environments µ Factorizable policies Probabilistic Table of contents Universal Sequence Predict

Universal Articial Intelligence Decisions Based on Algorithmic Sequential Probability adapted by ukasz Stafiniak from the book by Marcus Hutter

environments µ...................... 26 Factorizable policies........................... 27 Probabilistic Table of contents Universal Sequence Prediction........................... 5 Epicurus, Hume, Bayes, Solomono................... 5 Ockham, Information Theory and Probability................ 6 Algorithmic Error and Loss Bounds....................... 11 Convergence,..................................... 13 Convergence bounds.................................... 14 Error bounds..................................... 15 Loss of Chance................................. 18 Games...................................... 19 Optimality The Universal Algorithmic Agent AIXI.................... 21 in Known Probabilistic Environments.................. 21 Agents model in Functional Form......................... 22 AIµ model in Recursive and Iterative Form................. 24 AIµ Turing Machines reference PTMs................ 28 Persistent Universal Algorithmic Agent AIXI...................... 29 The 2

Order Relation....................... 30 Intelligence concepts........................... 31 Separability Optimality, Discounted Future Value Function..... 33 Value-Related Choice of the Horizon........................ 38 The as random variables........................... 40 Actions mixture of MDPs............................ 41 Uniform Environmental Classes......................... 43 Important Prediction................................ 43 Sequence Games.................................. 44 Strategic the AIξ model for game playing................ 45 Using Minimization.............................. 46 Function Learning from Examples (EX).................. 48 Supervised AIXItl and Optimal Search............................. 49 The Fastest and Shortest Algorithm for All Problems............ 49 Search................................... 50 Levin. Algorithm M ε The Fast p............................ 51. AIXI Model............................. 52 Time-Bounded Probability Distributions............... 52 Time-Limited Best Vote Algorithm........................... 52 3

Universal Time-Bounded AIXItl Agent............. 53 The of AIXItl........................... 54 Optimality program that prints itself....................... 61 A Techniques.................................. 62 Proof Prior and OOPS reference TheNewAI................. 56 Speed Prior..................................... 56 Speed Ordered Problem Solver....................... 57 Optimal Machine reference GoedelMachines................... 60 Goedel Possible Types of Goedel Machine Self-improvements.......... 63 4

Universal Sequence Prediction Epicurus, Hume, Bayes, Solomono Ockham, Epicurus' principle of multiple explanations Occam's razor (simplicity) principle Hume's negation of induction Bayes' rule for conditional probabilities Solomono's universal theory of inductive inference here = reasoning about future from past experience. Induction approach (transductive inference) = predictions without building a Prequential model. Every induction problem can be phrased as sequence prediction task. Classication is a special case of sequence prediction. We are interested in maximizing prot / minimizing loss. Separating noise from data is not necessary. 5

Information Theory and Probability Algorithmic A prex code (prex-free set of strings) satises: x P 2 l(x) 1 Kolmogorov complexity: min K(x) {l(p): U(p) = x}, K(x y) 4 min {l(p): 4 U(y p)=x} p p Properties: K(x) + l(x)+2log 2 l(x) K(x, y)= + K(y x,k(x))+k(x) K(x) + log 2 P(x)+K(P) if P:B [0, 1] enum. and x P(x) 1 Bayes rule: p(h i D)= p(d H i )p(h i ) i I p(d H i)p(h i ) Kolmogorov complexity is only co-enumerable=upper semi-computable. 6

Universal prior: probability that the output of a universal monotone TM with x when provided with fair coin ips on the input tape starts 4 M(x) p:u(p)=x 2 l(p) µ 0 is a semimeasure if µ(ǫ) 1 and µ(x) µ(x0) + µ(x1) (a probability measure if equality holds). Universality of M. M multiplicatively dominates all enumerable semimeasures: M(x) + 2 K(ρ) ρ(x) ρ is an enumerable semimeasure. M is enumerable but not where Conditioning on a string: estimable. M(y x) 4 M(x y) M(x) 2 K(y x) 7

Try to predict the continuation x n B of a given sequence x 1 x n 1. (1 M(x t x <t )) 2 1 2 t=1 1 2 ln2 Km(x 1: ) t=1 lnm(x t x <t ) = 1 2 lnm(x 1: ) If x 1: is computable, then Km(x 1: ) <, and M(x t x <t ) 1. Assume now the true sequence is drawn from a computable probability distribution µ. The probability of x n given x <n is µ(x n x <n ) = µ(x 1:n )/ µ(x <n ). t=1 µ(x <t )(M(0 x <t ) µ(0 x <t )) 2 + 1 ln2 K(µ)< 2 x t 1 <t B Posterior convergence of M to µ: is, M(0 x That <t ) µ(0 x <t tends to zero with µ-probability 1. We will ) a proof later, by approximating M with see ξ(x) 4 ν M 2 K(ν) ν(x) 8

A sequence is µ-martin-loef random (µ.m.l.) i: c nm(x 1:n ) cµ(x 1:n ) µ/ξ-martin-loef random (µ.ξ.r.) i: is c nξ(x 1:n ) cµ(x 1:n ) A theorem true for all µ-m.l. random sequences is true with µ- 1. probability Complexity increase: K(yx) (prex Kolmogorov complexity) K(y) K(x y)+o(1) C(y x) C(y) K (x y) + (plain Kolmogorov com- O(K(C(y)) plexity) KM(yx) KM(y) K (x y) + O(K(l(y)), KM(x) 4 log 2 M(x) KM(yx) KM(y) K (µ y) log 2 µ(x y) + O(K(l(y)) where K {K, C }. 9

Predictor based on K fails, due to K(x1)= + K(x0). The monotone complexity Km(x) 4 min p {l(p): U(p) = x does not suer from this. } 4 m(x) 2 Km(x) extremely close to M(x). m = 2 Km is converges on-sequence rapidly n 1 m(x t x <t ) 1 2 Km(x 1:n) t=1 m(x t x <t ) 1 at most Km(x 1: ) times may converge slowly o-sequence n m(x t x <t ) 2 Km(x 1:n) t=1 s U,x 1: : Km(x 1: )=s t=1 m(x t x <t ) 2 s 2 may converge probabilistic environments not for t msr µ M comp \ M det : m(x t x <t ) µ(xt x <t ) x 1: m is not a semimeasure, but normalization does not improve the above. 10

Error and Loss Bounds Convergence, Assumptions: a mixture distribution is a ξ w ν sum of probability -weighted ξ(x t x <t )= ν M w ν (x <t )ν(x t x <t ), w ν (x 1:t ) 4 w ν (x <t ) ν(x t x <t ) ξ(x t x <t ) distributions ν of a set M containing the true distribution µ. and w ν (ǫ)=w ν. Distance measures: absolute (or Manhattan): a(y, z) 4 quadratic (or squared Euclidean): s(y, z) 4 i y i z i i (y i z i ) 2 (squared) Hellinger distance: h(y, z) 4 relative or divergence: 4 d(y, entropy KL z) 4 b(y, z) i y i y i z ln divergence: absolute i Entropy inequalities: 1 1 2 f f d 2 s d b d a 2d i ( ) 2 yi zi i y iln y i z i h d 11

Instantaneous time and distances ξ: (at t) total between and µ X = {1,, N }, N = X, i = x t, y i = µ(x t x <t ), z i = ξ(x t x <t ) n a t (x <t 4 ) x t µ(x t x <t ) ξ(x t x <t ), A n 4 t=1 E[a t (x <t )] s t (x <t 4 ) x t (µ(x t x <t ) ξ(x t x <t )) 2 n, S n 4 t=1 E[s t (x <t )] ( ) 2, h t (x <t 4 ) x t µ(x t x <t ) ξ(x t x <t ) Hn 4 n E[h t (x <t )] t=1 d t (x <t 4 ) x t µ(x t x µ(x t x <t ) <t, D n ξ(x )ln t x <t ) n t=1 4 b t (x <t 4 ) x t µ(x t x <t ) µ(x t x <t ) n ln, B n 4 ξ(x t x <t ) t=1 E[d t (x <t )] For the rst convergence result on next page says: example n [ ] 1 E lnw µ t=1 x t (µ(x t x <t ) ξ(x t x <t )) 2 E[b t (x <t )] 12

Convergence Convergence of ξ to µ: S n D n lnw µ 1 < t s t (x <t ) d t (x <t ) 0 w.µ.p.1 ξ(x t x <t ) µ(x t t x <t ) 0 w.µ.p.1 (and i.m.s.) for any xt [ ( ) ] 2 n ξ(x E t x <t) 1 H n D n lnw 1 µ < t=1 µ(x t x <t ) ξ(x t x <t ) t ξ(x µ(x t x <t ) 1 t x <t ) t and i.m.s. µ(x t x <t ) 1 w.µ.p.1 b t (x <t ) d t (x <t ) a t (x <t ) 2d t (x <t ) B, n D n A n 2n D n w where µ the weight of µ in ξ and x is 1: arbitrary (nonrandom) is sequence. µ/ξ-randomness cannot be decided from ξ being a mixture distribution the dominance property alone. (E.g., for Bernoulli sequences, it is and to denseness of M related Θ.) 13

When M, but µ µˆ with the KL divergence M D n (µ µˆ) 4 x 1:n µ(x µ(x 1:n) 1:n )ln µˆ(x 1:n ) [ D n ln = E µ(x 1:n) ξ(x 1:n ) c, then ] [ = E µˆ(x ] [ 1:n) + E µ(x ] 1:n) lnw µˆ 1 ln + ln c ξ(x 1:n ) µˆ(x 1:n ) Error bounds Θ scheme Θ A prediction ρ predicts x ρ t 4 argmax xt ρ(x t x <t ). Probability of wrong and expected of making the number errors: prediction n [ ] Θ e ρ Θ t (x <t ) 1 µ(x ρ Θ 4 t x <t ), E ρ Θ n 4 E e ρ t (x <t ) Error bound: t=1 ) Θ 0 E ξ Θ n E µ Θ n 2 (E ξ Θ n + E µ n S n Θ S n + 4E µ 2 n S n + S n Θ 2S n + 2 E µ n S n 14

Loss bounds l Let xt y t R be the received loss when taking action y t and Y x t is X t th symbol of the sequence. W.l.o.g., 0 l the xt y t 1. We name an action a even if X Y. prediction A prediction scheme Λ ρ predicts Λ y ρ t min 4 arg y t Y x t ρ(x t x <t )l xt y t l t Λ ρ (x <t ) 4 E t [l xt y t Λρ n Λ ], L ρ n 4 t=1 [ ] Λ E l ρ t (x <t ) The actual and total µ-exptected loss: 15

Unit bound: loss Λ 0 L ξ Λ n L µ Λ n D n + 4L µ 2 n D n + D n Λ 2D n + 2 E µ n D n Collorary: Λ L ξ L is Λ µ nite nite, is Λ L ξ 2D 2lnw 1 µ µ for deterministic x yl if xy = 0, ( Λ L ξ Λ n /L µ Λ n = 1 + O (L µ n ) 1/2) L Λµ n 1 ( ) Λ L ξ Λ n L µ Λ n = O L µ n Λ any scheme: Let be prediction Λ L µ n L Λ Λ n, l µ t (x <t ) l Λ t (x <t ) L Λ Λ n L ξ Λ n 2 L ξ n D n Λ L ξ n /L Λ n 1 + O ((L Λ n ) 1/2) 16

l t Λ ξ l t Λ µ 0 would follow from ξ µ by continuity, but l t Λ ξ is in general Fortunately, is at ξ discontinous. it continuous = µ. bound: Instantaneous loss ( ) ] n Λ t=1 E[ l ξ Λ 2 t (x <t ) l µ t (x <t ) 2D n 2lnw 1 µ < Λ 0 l ξ Λ t (x <t ) l µ t (x <t ) x t ξ(x t x <t ) µ(x t x <t ) t 2d t (x <t ) 0 w.µ.p.1 Λ 0 l ξ Λ t (x <t ) l µ Λ t (x <t ) 2d t (x <t )+2 l µ t t (x <t )d t (x <t ) 0 w.µ.p.1 The function could depend on time and even on individual history, loss it is bounded: l enough t that xt y t (x <t ) [l min, l max ], l l max l. min 4 loss Global bound: ) Λ 0 L ξ Λ n L µ Λ n l D n + 4 (L µ n n l min l D n + l 2 2 D n 17

Games Chance of Λ p The prot t = l xt y t [p max p, ], the total prot p max P ρ Λ n = L ρ n and average per Λ ξ 1 p n P Λ ξ 4 n. Time to win: n round prot the Λ ξ Λ p n = µ p n O(n 1/2 n Λ ) µ p n ( ) 2 2p n > Λ k Λµ µ µ Λ p n > 0 ξ p n > 0 p n w where µ = e k µ. Information-Theoretic Interpretation: we need that many bits about µ (in the worst case) by received prot. (Read transfered the book.) from 18

Optimality prior ξ is Pareto optimal w.r.t.: A s t, S n, d t, D n, e t, E n, l t, L n Balanced optimality L: Pareto w.r.t. Λ Λ ν 4 L ν L ξ ν 4, ν M w ν ν 0 particular, In η w 1 η max λ M λ. bounds for the mean squared sum ξ S w We have derived nν w 1 ln ν and for Λ L ξw loss Λ regret the nν L ν nν 2lnw 1 ν + 2 lnw 1 Λ ν L ν nν. of universal weights: within the set of enumerable weight Optimality with short program, the universal weights w functions ν = 2 K(ν) to lead loss within an additive (to ln w 1 smallest bounds µ in all enumerable ) constant environments. It is dicult to prove that universal weights are optimal. See excercise 3.7 in the book. 19

For maximum a posteriori approximator 4 max ρ(x) {w ν ν(x): ν or M} the minimum description length estimator ρ(x) 4 equivalently Multistep predictions: horizon h, 1 2 for 1 E[at:nt ] hd 2 h lnw µ for arbitrary horizon, convergence in the mean (slow) Continuous Classes: entropy bound Probability Continuous D n E µ(x 1:n) 4 ξ(x 1:n ) w µ 1 + d n 2 2π + 1 2 detj n + o(1) ln ln ln (j n) 1 is the Fisher information matrix for the family of distributions θ :θ R Θ d and continuity conditions hold (see the book). where M = {µ } (λx.argmin ν M {log 2 ν(x) 1 + log 2 w 1 ν })(x): [ ( E µ(xt x <t ) ρ (norm) (x t x <t ) ) ] 2 1 w µ t=1 x t ρ(x where t x <t 4 ) ρ(x 1:t )/ρ(x <t and ) ρ norm (x t x <t 4 ) ρ(x 1:t )/ x t ρ(x 1:t ) bounds are tight, thus MDL converges i.m.s. but convergence speed These can be exponentially worse than for ξ. 20

The Universal Algorithmic Agent AIXI Agents in Known Probabilistic Environments The agent model (deterministic case): p: X Y, y 1:k = p(x <k ), q: Y X, x 1:k = q(y 1:k )=r k o k X 4 R O. y Action k determined by a policy p depending on the I/O history is y 1 x 1 y k 1 x k 1 yx <k. pq V km 4 m i=k r(x i pq ) Future total reward the agent receives in cycles k to m: 21

case: µ. best agent maximizes the expected utility General environment The V p (called pµ value function) µ V 1m 4 q µ(q)v 1m p arg max 4 p pq : pq p V 1m V q km V pq km p: y pq p <k = y q <k model Functional Form AIµ in model is the agent with policy AIµ p µ maximizes the µ-expected that The r total reward 1 + + r m p p µ 4 arg max, p i.e. p V µ In cycle the (future) value k V pµ km (yx <k of policy p is dened as the µ- ) of the future reward sum r expectation k + + r m µ, or true, or gener- (the value). in k history ẏẋ ating Assume cycle the is <k Q k {q: q(ẏ, let <k) = ẋ <k 4 all environments consistent with this history. Then: } be the set of pq µ(q)v V pµ km q Q k km (ẏẋ <k ) 4 µ(q) q Q k 22

We generalize the nite lifetime m to a dynamic farsightedness h k m k k + 1 1, called horizon. where P k 4 p k max 4 arg p P k V pµ kmk (ẏẋ <k ) {p: y k: p(ẋ <k ) = ẏ <k y k } is the set of policies consistent with By inserting p recursively k 1,, p 1 model) (AIµ : the current history. p (ẋ <k ) 4 p k (ẋ k p k 1 (ẋ <k 1 p 1 ) constant m we have: For V µ km (ẏẋ <k ) V pµ km (ẏẋ <k ) p P k sequence prediction it was enough to maximize the next reward, here (In sum of future rewards is important.) the 23

Chronological probability distributions: ρ(yx <ky k ) 4 AIµ model in Recursive and Iterative Form underlined arguments represent probabilistic variables and non-underlined Notation: variables represent conditions: ρ(x <n x n) = ρ(x 1:n) ρ(x <n) x k ρ(yx 1:k). Expected reward: V µ km (yx <k y k 4 ) x k [ µ r(xk ) + V k+1,m (yx 1:k ) ] µ(yx <k k) yx p y How chooses k V µ : km (yx <k max ) 4 yk V µ km (yx <k y k ). with induction start: Together the µ V m+1,m (yx 1:m ) 4 0 V µ km completely dened: is V µ km (yx <k )=max y k x k [ µ r(xk ) + V k+1,m (yx 1:k ) ] µ(yx <k k) yx 24

m If k function of p and ẏẋ is the horizon <k history in cycle k, is the actual ẏ k = V µ arg max kmk (ẏẋ <k ) y k the Unfolding recursion: µ ẏ k ẏ k 4 arg max max y k+1 y k x k x k+1 max y mk x mk (r(x k )+ + r(x mk )) µ(ẏẋ <k yx k:m k ) value a policy p: The for general V pµ km (yx <k ) 4 (r k + + r m ) µ(yx <k k:m) y1:m =p(x yx <m ) x 1:m Equivalence Functional and Explicit AI Model of 1:k)= µ(q) µ(yx q:q(y 1:k )=x 1:k 25

Factorizable environments µ. Assume that the cycles are grouped into independent episodes: s 1 µ(yx 1:n)= r=0 µ r (yx n r +1:n r+1 ) ẏ Then k on µ depends r of episode x only: and y and r ẏ k arg = max (r(x k )+ + r(x t )) µ r (ẏẋ nr k:t) y k +1:k 1yx x t with t 4 min {m k, n r+1 }. 26

policies. For policy π: Probabilistic V π µ = ( (r 1 + + r m ) µ(x m yx <m y m )π(y m yx <m ) yx 1:m = µ(x 1 y 1 )π(y 1 )) (r 1 + + r m )µ(x 1:m y 1:m ) π(y 1:m x <m ) yx 1:m (r 1 + + r m )ρ(y 1:m x 1:m ) yx 1:m optimal there is always a deterministic one: Among V policies µ π = max max π p V µ p 27

Turing Machines reference PTMs Persistent independently introduced model of interactive computation based on non- monotonic Turing machines (three tapes: read-only, work, deterministic write-only) cuts the environment out of the loop inputs are arbitrary based on coinductive notions (coalgebras, LTSs, bisimulation) stresses innite input / output alphabets (e.g. strings) PTM operation: 28

Universal Algorithmic Agent AIXI The the unknown prior probability µ AI in the AIµ model by a universal prior Replace semi-probability M AI with M(q) 4 2 l(q). Equivalence of functional and iterative model holds, equivalence with recursive AI model holds after normalization of km AI still M (which will no longer be enumerable, but the universal value V pξ will be q knows length of the output from input length. still enumerable). 1:k) = M(yx q:q(y 1:k )=x 1:k 2 l(q) over enumerable chronological semimeasures: Summing all 4 1:n) 2 ξ(yx K(ρ) 1:n)= ρ(yx 1:n) M(yx ρ n k=1 ( 2 + k) µ(yx <k k) ξ(yx <k k)) ln2 K(µ) µ(yx yx yx x 1:k Just like in the sequence prediction case: and k i.m.s. if hk = m ξ(yx <k k:m yx k ) µ(yx<k k:m yx k ){ k k + 1 h max < i.m. m general for k 29

Order Relation. Extend the ξ-expected reward denition to programs Intelligence that are not consistent with the current history: V pξ km (ẏẋ <k 4 ) 1 N p q q:q(ẏ <k )=ẋ <k 2 l(q) V km N is the normalization factor only necessary for the expectation interpretation. <k unaltered for further where For p P k, p is p modied to output ẏẋ but cycles. is more or equally intelligent than p p i p p k ẏẋ <k : V pξ p kmk (ẏẋ <k ) V ξ kmk (ẏẋ <k ) completely unknown µ we could take ξ = M and treat AIXI as optimal by For (similarily to when we take uniform prior over parameters for the construction bandits problem). 30

concepts. Separability Self-optimizing policies: for p independent of µ best 1 V p µ best m 1m alternatively, The HeavenHell example: V 1m p bestµ V 1m pµ o( ) µ, p m 1 V pµ. m 1m The example: OnlyOne { } µ y : y Y, K(y )= log 2 Y, µ y (yx <k y k 1 ) δ 4 yk y k M 4 are N= Y such y. The number of errors is There E p N 1= Y = 2 K(y ) = 2 K(µ) 2 K(µ) is the best possible bound depending on K(µ); it could be OK if K(µ ẋ <k ) = O(1). µ is passive if environment is not inuenced by agent output. M and µ M are pseudo-passive if the corresponding p best = p ξ is self-optimizing. 31

The number suboptimal µ-expected of choices [ ] n D nµξ E 4 1 δ µ where ξ ẏk,ẏ, ẏ µ k k p µ (ẏẋ <k ) k=1 can asymptotically learned if µ be n D nµξ /n 0, i.e. Dnµξ = o(n) AIXI asymptotically any problem. Claim: can learn relevant if µ is uniform µ(yx <k y k k x ) ξ(yx <k y k k x ) < c µ(yx <k y k k) ξ(yx <k y k k) y x x k x k are relevant µ that are not uniform. Uniform µ can be asymptotically nµξ bounded horizon. There learned for appropriately weighted D and µ is forgetful if µ(yx <k yx k) becomes independent of yx <l for xed l and k. µ is farsighted if lim mk ẏ k (m k ) exists. Markovian, generalized (l th order) Markovian, ergodic, factorizable. 32

Value-Related Optimality, Discounted Future Value Function γ discounted weighted-average future value of probabilistic policy π in environment <k ρ-value of π given yx (the <k The ρ given history yx ): V πρ 1 kγ (yx <k ) 4 lim (γ k r k + + γ m r m )ρ(yx <k k:m)π(yx <k ȳx k:m ) Γ k m yx yx k:m γ i dened as the policy AIρ p ρ is model : discounted The. with Γ k 4 i=k p ρ arg max 4 π V πρ ρ p kγ, V kγ V ρρ 4 kγ = V πρ kγ V πρ max kγ π π Linearity convexity and of V ρ in ρ: V πξ kγ = w ν πν k V kγ V ξ and kγ w ν ν k V kγ ν M ν M ξ(yx <k). where ξ(yx <k yx k:m)= ν M w k ν ν(yx <k yx k:m) with w k ν 4 w ν ν(yx <k) Pareto optimality: there is no other policy π with V kγ πν V kγ p ξν ν M and strict inequality for at least one ν. 33

Balanced Pareto optimality: 0 V ν kγ V π kν ν kγ = : k 0 V ν p kγ V ξν kγ 1 ν w k k with k 4 ν M w k ν ν k where all quantities depend on history yx, <k. If there exists a sequence of self-optimizing policies then the uni- π k policy p ξ versal is self-optimizing π kν k ν π k ν: V kγ Vkγ w.ν.p.1 p V ξ µ k µ kγ Vkγ w.µ.p.1 where the probabilities are conditional on historic perceptions x <k. 34

The V πµ values kγ V µ and kγ in µ, and V are p µˆµ continuous kγ continuous is µˆ µˆ µ: If in at = µ(yx k) µ(yx <k k) <kyx yx ε yx <ky k k k 0 then i. ii. x k V πµ kγ V πµˆ kγ δ(ε) V µ kγ V µˆ kγ δ(ε) V µ p kγ V µˆµ kγ 2δ(ε) iii. { k k for all 0 and yx <k δ(ε) = r max min, where n k (n k)ε+ Γ n } ε 0 0. Γ k 35

If y 1:m = p(x 1:m and ) V (on-policy) k = V k (yx p <k the universal ), then value V undiscounted pξ with bounded dynamic horizon future kmk h k = m k k converges i.m.s. to the true value + 1 V pµ the discounted and kmk V future pξ value kγ p i.m. V pµ converges to kγ for any summable discount sequence γ k. V pξ pµ i. km V km (m k + 1)r max a k:m V pξ pµ kγ V kγ r max 2d k: ( ) 2 ii. pξ pµ k=1 E V kmk V kmk 3 2 2hmax r max D ( ) E V pξ 2 pµ kγ V kγ 2 k 2rmax (D D k 1 ) 0 pξ iii. V kmk k pµ Vkmk if h max < i.m.s. pξ k pµ V kγ Vkγ for γ any i.m. x k:m µ(yx <k k:m) ξ(yx <k, k:m) yx yx a where k:m 4 x k:m µ(yx <k k:m)ln yx µ(yx k:m) <kyx d k:m 4 ξ(yx <k yx k:m), D k 4 d 1:k lnw µ 1 <. 36

If M is a countable clas of ergodic MDPs, ξ 4 Markov decision process is ergodic, if there exists a policy which visits each state innitely often with probability 1. There exist self-optimizing p m for the class of ergodic MDPs: policies p m ν M MDP1 : 1 m V 1m ν 1 m V p mν 1m = O(m 1/3 ) k : k 1 unbounded horizon h With e eective k π k ν M MDP1 : 1 m V 1m π kν m V 1m ν history yx γ for any k+1 <k if 1. γ k ν M w νν, then AIξ p m ξ ν M: 1 m V p ξ m ν m 1 1m m V 1m ν and p V ξ m ν k ν kγ Vkγ γ k+1 γ k 1 maximizing V 1m pξ and p ξ maximizing V kγ πξ are self-optimizing: if If M is nite, then the speed of rst convergence is at least O(m 1/3 ). Ergodic POMDPs, ergodic l th -order MDPs and factorizable environments also allow self-optimizing policies. 37

faster than increases computable function any Choice of the Horizon. The Fixed (eective) horizon is OK if we know the lifetime of the agent, e.g. if probability of surviving to the next cycle is (always, independently) the γ 1, we can choose geometric discount rate γ. < General introduces eective unbounded discounting horizons. Let r k γ 4 k r k γ with k > 0 r and k [0, Γ If 1]. k 4 i=k γ i, then < pρ 1 V kγ 4 Γ lim m V pρ β k km h exists. β-eective horizon k min {h 0: Γ 4 k+h β Γ k Approximating V }. kγ h β the rst by k at of error an introduces terms βr. max most e h k 4 h β=1/2 k. Horizons γ k Γ k = β i=k γ i h k nite { 1 for k m 0 for k > m geometric γ k, 0 γ < 1 1 m k + 1 (1 β)(m k + 1) γ k 1 γ lnβ lnγ power k 1 ε, ε > 0 1 ε k ε (β 1/ε 1)k, ε > 0 1 k 1+ε k ε (lnk) ε k β 1/ε harmonic ln 2 universal K(k) decreases slower than any computable function 38

{ horizon: ẏ ( ) Innite take k lim inf m Y k (m) (m) 4 Y k, (m ẏ k } ) where k : m k lim m. Limit m V km (yx <k exist. But immortal agents not needs ) lazy: if postponing reward makes it bigger, the agent construed this are will not get any reward. way For let ρ ξ (1 α)ξ αρ, then 4 + M, [ ] V µ sup p lim kγ V ξ µ kγ α r max k (1 α)w µ examples where equality holds. Thus a belief contamination of magnitude µ completely degenerate performance. (???) with α comparable to w can (Posterization) It is not true, that if w ν = 2 K(ν), then w k ν 2 K(ν yx <k), for ν M w k ν ν(. yx <k ) 4 ξ(. yx <k ). 39

ξ AI alt 1:n) 4 (yx M(yx 1:n ) x 1:n M(yx 1:n ) as random variables Actions of dening ξ as a mixture of environments, we could use a universal dis- Instead tribution over perceptions and actions and then conditionalize to the actions: M is the Solomono's prior (we could use ξ where U well). Open problems: as is ξ AI enumerable? alt ξ AI alt = ξ? AI M(yx could <k ȳ k be close to the action of ) p ξ p ξ alt for large k, jus- and/or tifying the interpretation that M(yx <k ȳ k ) is the agent's own belief in selecting action y k? 40

Uniform mixture of MDPs µ T (a 1 1 s a a n)=t 1 ns s0 s 1 a T n sn 1 s n Let µ T M MDP be a completely observable MDP with transition matrix T. ξ(as 1:n) 4 T w T µ T (as 1:n)dT Reward is a function of state r k = r(s k ). The Bayes mixture For a uniform prior belief, ξ(as <n as n)= ξ(as 1:n) ξ(as <n) = a N n sn 1s n a n + S 1 s N sn 1 s S = X is the number of states and where a N ss as 1:n ) number of transitions from s to s under a. is the historical (i.e. in Although T is continuous and contains non-ergodic environments, the Bayes optimal policy p ξ is self-optimizing for ergodic environments µ T M MDP1. (Intuition: T is compact and non-ergodic environments have measure zero.) 41

Posterior belief w T (as 1:n ) µ T (a s ) is a (complex) distribution over pos- T. Most RL algorithms only estimate a single T (e.g. a most likely, sible an expected T ). Policy p ξ appropriately explores the environment, while or popular based on E[T], or on Maximum Likelihood, lack exploration. Expected the policy transition probability: E[T a ss as 1:n ] T a 4 ss w T (as 1:n )dt = N ss a + 1 a + S T s N ss 42

Important Environmental Classes Here ξ = ξ U = M is the Solomono's prior, i.e. AIξ = AIXI. Prediction Sequence AIµ model and SPµ = Θ The µ (derived for binary alphabet B) for known model µ are equivalent and the expected prediction error relates to the environment function: value V 1m µ = m E m Θ µ general ξ is not symmetric in y The i r i (1 y i )(1 r i thus more dicult. We ) on deterministic computable envirnment ż = ż concentrate 1 ż 2 with Km(ż 1 ż n ) Km(ż) < and horizon m k = (greedily maximize next reward; k is sucient for SP but does not show the behavior of AIXI for a universal this We have (best proven bound): horizon). E AIξ < 1 α = 2Km(ż)+O(1) The intuitive interpretation is that each wrong prediction eliminates at least one p of size l(p) + Km(ż). The best possible bound is: program E SPξ + 2ln2Km(ż) 43

Strategic Games restrict ourselves to deterministic strictly competitive strategic games. We a bounded length game padded to length n. Assume the environment Assume the strategy: minimax uses ȯ k max min max = arg min min V (ẏ 1 ȯ 1 ẏ k o k y n o n ) o k y k+1 o k+1 y n o n r 1 = = r n 1 = 0, r n = 1 if the AIµ agent wins, 1 2 ẏ k AI argmax = y k arg max = y k o k o k max y n max y n 1 for draw, and 0 if environment o n V (ẏȯ <k yo k:n )µ SG (ẏȯ <k yō k:n ) max y n o n 1 minv (ẏȯ <k yo k:n )µ (ẏȯ <k yō k:n 1 ) o n SG wins. Illegal move is an instant loss. = arg max = y k min min max V (ẏȯ <k yo k:n ) = ẏ SG k o k+1 y n o n the game is played multiple times, then µ is factorizable. But if the game has If length, then µ is no longer factorizable: a better player prefers short variable games and prefers a quick draw over too long a win. 44

the AIξ model for game playing. The AIξ agent has only much less Using log than 2 information from a single game, so needs at least 3 number O(K(game)) games. Variable length games are better: the AIξ agent will quickly learn legal of from short initial games. Next, AIξ will learn the losing positions. Then, moves will win some games by luck or will exploit the game symmetry to learn win- AIξ positions. ning AIξ agent can take advantage of environmental players with limited ratio- The nality by not settling on the minimax strategy. 45

Minimization Function will consider distributions over functions: We µ FM (y 1 z 1 y nz n) 4 f:f(y i )=z i, 1 i n µ(f) model is not appropriate, because it will stick with an argument of value Greedy below the expectations for other values. For episodes of length m: already ẏ k = arg min y FM k z k min y m z m (α 1 z 1 + + α m z m ) µ(ẏ 1 ż 1 ẏ k 1 ż k 1 y k z k y mz m) we want the last output to be optimal, set α If 1 = = α m 1 = 0, α m = (FMFξ); 1 we want already good approximation along the way, set α if 1 = = α m = 1 etc. (FMSξ), the model: FMξ For ξ (y 1 1 z y 4 n) nz FM q:q(y i )=z i, 1 i n 2 l(q) FMξ will never cease searching for minima, will test an innite set of y's for m FMFξ will never repeat any y except at t = m. FMSξ will test a new y. t (for m) only if the expected f(y xed t is not too large. ) 46

AIµ/ξ, we need r For k = α k z k, o k = z k. has problem with the FMF model: it must rst learn that it has to minimize AIξ a function. It can learn it by repeated minimization of (dierent) functions. 47

Learning from Examples (EX) Supervised environment presents inputs o The k 1 = z k v k (z k, v k ) R (Z {?}) Z (Y The {?}) be distributed = R O. relations might probability with ). σ(r µ (y 1 1 x y n)= µ R (o 1 o n ) nx )σ(r AI R: 1<i nr(z i,y i )=r i x where i = r i o i o and i 1 = z i v i v with i Y {?}. The agent only needs AIξ O(1) from reinforcement r bits k learn to extract z to i o from i 1 return y and i (z with i, y i ) R. 48

time Always tp (x) t p and the multiplicative constant (x) 1 + ε + d p huge if no is time bound exists. This could be the case for many complex approximation good AIXItl and Optimal Search Fastest and Shortest Algorithm for All Problems The Let p : a given algorithm or a specication of a function p: any prog. provably the same as p with time complexity provably in t p time tp (x): the time needed to compute t p (x) xed ε (0, 1 ) M 2 p ε computes p (x) in time algorithm the For ε time Mp (x) (1+ε) t p (x)+ d p t ε p (x)+ c p ε time constants c with p d and p of x. independent example, if matrix multiplication algorithm p with time For p (x) = d n 2 n log ε time exists, then Mp (x) (1 + ε)dn 2 for all x. n+o(1) log Blum's speed-up theorem doesn't aect this, because its speed-up sequence is The computable. not problems and for universal reinforcement learning. 49

Search Levin a quickly computable function g devoting 2 l(p) time portion to the Inverses inverse p. Computation time: 2 l(p) + time potential algorithm p (x) + (time p (x) for g(p(x)) x). checking = be implemented as Li & Vitanyi's SEARCH(g): run every p shorter than i can It for 2 i 2 l(p) steps in phase i = 1, 2,3, until g is inverted on x. 50

A. Algorithm search the space of proofs for proofs of formulas of the form Systematically C. Algorithm U on (p fast, x). For each executed step decrease t fast by 1. Run Fast M The ε Algorithm p M ε Algorithm p (x). the shared variables L 4 {}, t fast 4, p fast 4 p. Initialize algorithms A, B, C in parallel with relative computational resources Start ε,ε, 1 2 respectively. ε y: u(?p, y) = u(p, y) u(?t, y) time(?p, y) For each such proof, add the answer substitution (p, t) to L. B. for all (p, t) L: Algorithm U on (t, x) in parallel for all t with relative computational resources Run 2 l(p) l(t). U halts for some t and U(t,x)<t fast if then t fast 4 U(t, x) and p fast 4 p and restart algorithm C. if U halts then abort computation of A and B and return U(p fast, x). 51

AIXI Model Time-Bounded Probability Distributions. Time-Limited We could limit environments in the universal mixture ξ to those of length l computable in time t arriving at expectimax algorithm with complexity t(ẏ k AIξ t l ) = O( Y h k X h k 2 l t ). (Considered poor unintelligent.) Vote Best Algorithm. the ξ-expected future reward is enumerable: Without normalization, V pξ km (ẏẋ <k ) 2 l(q) V pq pq 4 km, V km r(x pq 4 k + r(x pq )+ m ) q Q k every we select the best (possibly inconsistent with history) policy. But At cycle V kmk (approximable, the same eort as computing ẏ AIξ is uncomputable k the let ): policy estimate it (by w k p ): p(ẏẋ <k )=w 1 p y 1 p w k p y k p policy to claim to be better than it is: valid approximation No allowed VA(p) [ k w p 1 y p 1 ẏ 1 ẋ 1 w p k y p k : p(ẏẋ <k )=w p p 1 y 1 w p p k y k w p k V pξ kmk (ẏẋ <k ) ξ V kmk enumerable: is (p i ) VA(p i lim p ) i w i k = V ξ The convergence is not. kmk uniform in k, but it is OK: we select a policy in each step. 52

The Universal Time-Bounded AIXItl Agent. Systematically search the space of proofs shorter than l 1. P VA(?p) and for all answer substitutions. collect 2. Eliminate all p of length > l. 3. Modify each p to stop within t time steps (aborting with w k = 0 if needed) 4. Start rst cycle: k 4 1. Run every p(ẏẋ 5. <k ) = w p 1 y 1 pw p k y p k where all outputs are redirected to, auxiliary tape, incrementally by adding ẏẋ some k 1 the input tape and to continuing the computation of the previous cycle. Select 6. p k 4 arg max p w p k. 7. Write p ẏ k y 4 k k to the output tape. 8. Receive input ẋ k from the environment. 9. Begin next cycle k 4 k + 1, goto step 5. with e.g. accuracy-based learning classier systems XCS by Wilson, [Compare RL by Baum & Durdanovic.] market/economy 53

Optimality of AIXItl. Let be any extended chronological (incremental) p program (like above) of length l(p) l and computation time per There could be policies which produce good outputs within reasonable but their justication w p or proof VA(p) take unreasonably long. time, The inconsistent programs must be able to continue strategies started by p policies. A policy can steer the environment to a direction for which other We eectively more or equally intelligent than p p if call p c p k ẏẋ <k w 1:n w 1:n : p(ẏẋ <k )=w 1 w k p(ẏẋ <k )=w 1 w k w k w k t(p) t, for which there exists a proof of VA(p) of length l cycle P then, p c The length of p is l(p ) = O(log(l t l p. P, the setup time is (at )) depending the proof search technique) t setup (p ) = O(l 2 most, on P 2 l P ) and time per cycle is t cycle (p ) = O(2 l t ) (to go faster, we computation the eliminate provably poor policies: how many Pareto-good policies are could there?) it is specialized. This requires enough separability to recover. 54

Since AIXI is incomputable but assumes computable environments, it gamble with other AIXIs. Are there interesting environmental cannot global time data Algorithm/Properties POMDP learning active ecient ecient optimum exploration convergence generalization iteration yes/no yes YES YES NO NO NO yes Value/Policy with nite S yes/no NO NO YES YES NO NO YES YES TD linear func.approx. yes/no NO NO yes yes/no YES NO YES YES TD general func.approx. no/yes NO NO no/yes NO YES NO YES YES TD Direct Policy Search no/yes YES NO no/yes NO YES no YES YES Planners yes/no YES yes YES YES no no YES yes Logic with Split Trees yes YES no YES NO yes YES YES YES RL Expert Advice yes/no YES YES yes/no yes NO YES NO Pred.w. Levin Search no/yes no no yes yes/no yes YES YES YES Adaptive yes/no no yes yes/no YES YES YES YES OOPS RL yes/no no NO no yes/no yes yes/no YES YES Market/Economy no YES YES YES YES NO YES NO SPXI NO YES YES yes YES YES YES YES YES AIXI no/yes YES YES YES yes YES YES YES YES AIXItl yes yes yes no/yes NO YES YES YES YES Human AIXI can incorporate prior knowledge D: just prepend it in any encoding, it will decrease K(µ) into K(µ D). classes for which AIξ M M or AIξtl M M? 55

Prior and OOPS reference TheNewAI Speed Prior Speed Assumption. The environment is deterministic. The cumulative prior probability measure of all x incomputable Postulate. time t by any method is at most inversely proportional to t. within Algorithm. Set t 4 1. Start a universal TM with empty input tape. Repeat: the of executed so far exceeds t: While number instructions { heads up: set t 4 2 t toss: coin unbiased otherwise exit the input contains a symbol, execute it, otherwise set set cell's symbol If and set t 4 t/2. randomly 56

Ordered Problem Solver Optimal searcher is n-bias-optimal (n 1) if for any maximal total search time T max > 0 A is guaranteed to solve any problem r R if the problem has a solution p C it can be created and tested in time t(p, r) P(p r) T max /n (P is the taskspecic that bias). Basic ingredients of OOPS: Interruptible low- or high-level instructions-tokens (e.g. theorem Primitives. matrix operators for neural nets, etc.). provers, prex codes. Token sequences / program prexes in domainspecic Task-specic language. Instructions can transfer control to previously selected (loops / calls). A prex is elongated on program's explicit request. tokens prex may be a complete program for some task (programs are prex- A w.r.t. a task), but may request more tokens on another task (incrementallfree growing self-delimiting programs). to previous solutions. Let p n denote a found prex solving the Access n tasks. p 1,, p n are stored or frozen in nonmodiable memory rst by all tasks (accessible to p n+1 ) (but can be copied into modiable shared memory). task-specic Initial task-dependent, user-provided probability distribution on program Bias. prexes. 57

sux probabilities. Any executed prex can assign probability Self-computed distribution to its continuations. Distribution is encoded and manipulated in task-specic internal memory. searches. Run in parallel until p n+1 is discovered. The rst is exhaustive: Two tests all possible prexes in parallel on all tasks up to n + 1. The is focused: searches for prexes starting with p n and tests only on second n + 1 (such prexes already solve tasks up to n). When optimal solver task found as p n 0, at most half of future run time is wasted by the rst is search. backtracing. Depth-rst search in program space, with backtracing Bias-optimal triggered by running over time (prex probability multiplied by total search time so far). Space is reused. / experiments. Interpreter for FORTH-like language with recursive Example functions, loops, arithmetic, bias-shifting instructions, domain-specic instructions. First taught about recursion: samples of CF language {1 k 2 k }, k 30. This took 1/3 of a day. (OOPS found universal solver for all k.) Then, by rewriting its search procedure, it learned k-disk Towers of Hanoi within a couple of days. 58

OOPS-Based Reinforcement Learning. Two OOPS modules: 1. The predictor is rst trained to nd a better world model. The second module (control program) will then use the model to 2. for a future action sequence with better cumulative reward. search After current cycle's time for control program is nished, we will 3. the current action of the best control program found in step execute 2. OOPS is 8-bias-optimal. 59

Goedel Machine reference GoedelMachines executing some initial problem solving strategy, Goedel Machine simultaneously While runs a proof searcher which systematically and repeatedly tests proof tech- An unguarded part of GM switchprog can rewrite the whole GM. It is niques. executed when the GM has found a proof, that it will result in bigger only expected reward. 60

program that prints itself. A is no problem with part of a program to represent the whole program, to There Optimal Self-Changes. Given any formalizable utility function u Globally assuming consistency of the underlying formal system A, any self- and of p obtained through execution of some switchprog identied change the proof of a target theorem [that running switchprog increases through reward] is globally optimal: the utility of executing the present expected is higher than the utility of waiting for the proof searcher to switchprog produce an alternative switchprog later. any degree of accuracy. main(){char q=34, n=10,*a="main() {char q=34,n=10,*a=%c%s%c; printf(a,q,a,q,n);}%c";printf(a,q,a,q,n);} 61

Proof Techniques 1. get-axiom(n) a. Hardware axioms b. Reward axioms c. Environment axioms d. Uncertainty axioms and string manipulation axioms e. Initial state axioms f. Utility axioms 2. apply-rule(k, m, n) 3. delete-theorem(m) 4. set-switchprog(m,n) 5. state2theorem(m, n) GM hardware can itself be probabilistic, this has to be represented by a The logic and in exptectations about which theorems are. probabilistic 62

Possible Types of Goedel Machine Self-improvements Just change the ratio of time-sharing between the proof searching subroutine 1. and the subpolicy ethose parts of p responsible for environment interaction. Modify e only. For example, to conduct some experiments and use the 2. knowledge. (Even if it turns out that it would have been better resulting stick with previous routine, the expectation of reward can favor experimentation.) to 3. Modify the axioms to speed up theorem proving. Modify the utility function and target theorem, so that the new values are 4. according to current target theorem. better 5. Modify the probability distribution on proof techniques, etc. 6. Do promptly a very limited rewrite to meet some deadline. In certain uninteresting environments, trash almost all of the GM and 7. a looping call to a pleasure center-activating function. leave 8. Take actions in external environment to augment the machine's hardware. 63