environments µ Factorizable policies Probabilistic Table of contents Universal Sequence Predict
|
|
- Aubrey Caldwell
- 5 years ago
- Views:
Transcription
1 Universal Articial Intelligence Decisions Based on Algorithmic Sequential Probability adapted by ukasz Stafiniak from the book by Marcus Hutter
2 environments µ Factorizable policies Probabilistic Table of contents Universal Sequence Prediction Epicurus, Hume, Bayes, Solomono Ockham, Information Theory and Probability Algorithmic Error and Loss Bounds Convergence, Convergence bounds Error bounds Loss of Chance Games Optimality The Universal Algorithmic Agent AIXI in Known Probabilistic Environments Agents model in Functional Form AIµ model in Recursive and Iterative Form AIµ Turing Machines reference PTMs Persistent Universal Algorithmic Agent AIXI The 2
3 Order Relation Intelligence concepts Separability Optimality, Discounted Future Value Function Value-Related Choice of the Horizon The as random variables Actions mixture of MDPs Uniform Environmental Classes Important Prediction Sequence Games Strategic the AIξ model for game playing Using Minimization Function Learning from Examples (EX) Supervised AIXItl and Optimal Search The Fastest and Shortest Algorithm for All Problems Search Levin. Algorithm M ε The Fast p AIXI Model Time-Bounded Probability Distributions Time-Limited Best Vote Algorithm
4 Universal Time-Bounded AIXItl Agent The of AIXItl Optimality program that prints itself A Techniques Proof Prior and OOPS reference TheNewAI Speed Prior Speed Ordered Problem Solver Optimal Machine reference GoedelMachines Goedel Possible Types of Goedel Machine Self-improvements
5 Universal Sequence Prediction Epicurus, Hume, Bayes, Solomono Ockham, Epicurus' principle of multiple explanations Occam's razor (simplicity) principle Hume's negation of induction Bayes' rule for conditional probabilities Solomono's universal theory of inductive inference here = reasoning about future from past experience. Induction approach (transductive inference) = predictions without building a Prequential model. Every induction problem can be phrased as sequence prediction task. Classication is a special case of sequence prediction. We are interested in maximizing prot / minimizing loss. Separating noise from data is not necessary. 5
6 Information Theory and Probability Algorithmic A prex code (prex-free set of strings) satises: x P 2 l(x) 1 Kolmogorov complexity: min K(x) {l(p): U(p) = x}, K(x y) 4 min {l(p): 4 U(y p)=x} p p Properties: K(x) + l(x)+2log 2 l(x) K(x, y)= + K(y x,k(x))+k(x) K(x) + log 2 P(x)+K(P) if P:B [0, 1] enum. and x P(x) 1 Bayes rule: p(h i D)= p(d H i )p(h i ) i I p(d H i)p(h i ) Kolmogorov complexity is only co-enumerable=upper semi-computable. 6
7 Universal prior: probability that the output of a universal monotone TM with x when provided with fair coin ips on the input tape starts 4 M(x) p:u(p)=x 2 l(p) µ 0 is a semimeasure if µ(ǫ) 1 and µ(x) µ(x0) + µ(x1) (a probability measure if equality holds). Universality of M. M multiplicatively dominates all enumerable semimeasures: M(x) + 2 K(ρ) ρ(x) ρ is an enumerable semimeasure. M is enumerable but not where Conditioning on a string: estimable. M(y x) 4 M(x y) M(x) 2 K(y x) 7
8 Try to predict the continuation x n B of a given sequence x 1 x n 1. (1 M(x t x <t )) t=1 1 2 ln2 Km(x 1: ) t=1 lnm(x t x <t ) = 1 2 lnm(x 1: ) If x 1: is computable, then Km(x 1: ) <, and M(x t x <t ) 1. Assume now the true sequence is drawn from a computable probability distribution µ. The probability of x n given x <n is µ(x n x <n ) = µ(x 1:n )/ µ(x <n ). t=1 µ(x <t )(M(0 x <t ) µ(0 x <t )) ln2 K(µ)< 2 x t 1 <t B Posterior convergence of M to µ: is, M(0 x That <t ) µ(0 x <t tends to zero with µ-probability 1. We will ) a proof later, by approximating M with see ξ(x) 4 ν M 2 K(ν) ν(x) 8
9 A sequence is µ-martin-loef random (µ.m.l.) i: c nm(x 1:n ) cµ(x 1:n ) µ/ξ-martin-loef random (µ.ξ.r.) i: is c nξ(x 1:n ) cµ(x 1:n ) A theorem true for all µ-m.l. random sequences is true with µ- 1. probability Complexity increase: K(yx) (prex Kolmogorov complexity) K(y) K(x y)+o(1) C(y x) C(y) K (x y) + (plain Kolmogorov com- O(K(C(y)) plexity) KM(yx) KM(y) K (x y) + O(K(l(y)), KM(x) 4 log 2 M(x) KM(yx) KM(y) K (µ y) log 2 µ(x y) + O(K(l(y)) where K {K, C }. 9
10 Predictor based on K fails, due to K(x1)= + K(x0). The monotone complexity Km(x) 4 min p {l(p): U(p) = x does not suer from this. } 4 m(x) 2 Km(x) extremely close to M(x). m = 2 Km is converges on-sequence rapidly n 1 m(x t x <t ) 1 2 Km(x 1:n) t=1 m(x t x <t ) 1 at most Km(x 1: ) times may converge slowly o-sequence n m(x t x <t ) 2 Km(x 1:n) t=1 s U,x 1: : Km(x 1: )=s t=1 m(x t x <t ) 2 s 2 may converge probabilistic environments not for t msr µ M comp \ M det : m(x t x <t ) µ(xt x <t ) x 1: m is not a semimeasure, but normalization does not improve the above. 10
11 Error and Loss Bounds Convergence, Assumptions: a mixture distribution is a ξ w ν sum of probability -weighted ξ(x t x <t )= ν M w ν (x <t )ν(x t x <t ), w ν (x 1:t ) 4 w ν (x <t ) ν(x t x <t ) ξ(x t x <t ) distributions ν of a set M containing the true distribution µ. and w ν (ǫ)=w ν. Distance measures: absolute (or Manhattan): a(y, z) 4 quadratic (or squared Euclidean): s(y, z) 4 i y i z i i (y i z i ) 2 (squared) Hellinger distance: h(y, z) 4 relative or divergence: 4 d(y, entropy KL z) 4 b(y, z) i y i y i z ln divergence: absolute i Entropy inequalities: f f d 2 s d b d a 2d i ( ) 2 yi zi i y iln y i z i h d 11
12 Instantaneous time and distances ξ: (at t) total between and µ X = {1,, N }, N = X, i = x t, y i = µ(x t x <t ), z i = ξ(x t x <t ) n a t (x <t 4 ) x t µ(x t x <t ) ξ(x t x <t ), A n 4 t=1 E[a t (x <t )] s t (x <t 4 ) x t (µ(x t x <t ) ξ(x t x <t )) 2 n, S n 4 t=1 E[s t (x <t )] ( ) 2, h t (x <t 4 ) x t µ(x t x <t ) ξ(x t x <t ) Hn 4 n E[h t (x <t )] t=1 d t (x <t 4 ) x t µ(x t x µ(x t x <t ) <t, D n ξ(x )ln t x <t ) n t=1 4 b t (x <t 4 ) x t µ(x t x <t ) µ(x t x <t ) n ln, B n 4 ξ(x t x <t ) t=1 E[d t (x <t )] For the rst convergence result on next page says: example n [ ] 1 E lnw µ t=1 x t (µ(x t x <t ) ξ(x t x <t )) 2 E[b t (x <t )] 12
13 Convergence Convergence of ξ to µ: S n D n lnw µ 1 < t s t (x <t ) d t (x <t ) 0 w.µ.p.1 ξ(x t x <t ) µ(x t t x <t ) 0 w.µ.p.1 (and i.m.s.) for any xt [ ( ) ] 2 n ξ(x E t x <t) 1 H n D n lnw 1 µ < t=1 µ(x t x <t ) ξ(x t x <t ) t ξ(x µ(x t x <t ) 1 t x <t ) t and i.m.s. µ(x t x <t ) 1 w.µ.p.1 b t (x <t ) d t (x <t ) a t (x <t ) 2d t (x <t ) B, n D n A n 2n D n w where µ the weight of µ in ξ and x is 1: arbitrary (nonrandom) is sequence. µ/ξ-randomness cannot be decided from ξ being a mixture distribution the dominance property alone. (E.g., for Bernoulli sequences, it is and to denseness of M related Θ.) 13
14 When M, but µ µˆ with the KL divergence M D n (µ µˆ) 4 x 1:n µ(x µ(x 1:n) 1:n )ln µˆ(x 1:n ) [ D n ln = E µ(x 1:n) ξ(x 1:n ) c, then ] [ = E µˆ(x ] [ 1:n) + E µ(x ] 1:n) lnw µˆ 1 ln + ln c ξ(x 1:n ) µˆ(x 1:n ) Error bounds Θ scheme Θ A prediction ρ predicts x ρ t 4 argmax xt ρ(x t x <t ). Probability of wrong and expected of making the number errors: prediction n [ ] Θ e ρ Θ t (x <t ) 1 µ(x ρ Θ 4 t x <t ), E ρ Θ n 4 E e ρ t (x <t ) Error bound: t=1 ) Θ 0 E ξ Θ n E µ Θ n 2 (E ξ Θ n + E µ n S n Θ S n + 4E µ 2 n S n + S n Θ 2S n + 2 E µ n S n 14
15 Loss bounds l Let xt y t R be the received loss when taking action y t and Y x t is X t th symbol of the sequence. W.l.o.g., 0 l the xt y t 1. We name an action a even if X Y. prediction A prediction scheme Λ ρ predicts Λ y ρ t min 4 arg y t Y x t ρ(x t x <t )l xt y t l t Λ ρ (x <t ) 4 E t [l xt y t Λρ n Λ ], L ρ n 4 t=1 [ ] Λ E l ρ t (x <t ) The actual and total µ-exptected loss: 15
16 Unit bound: loss Λ 0 L ξ Λ n L µ Λ n D n + 4L µ 2 n D n + D n Λ 2D n + 2 E µ n D n Collorary: Λ L ξ L is Λ µ nite nite, is Λ L ξ 2D 2lnw 1 µ µ for deterministic x yl if xy = 0, ( Λ L ξ Λ n /L µ Λ n = 1 + O (L µ n ) 1/2) L Λµ n 1 ( ) Λ L ξ Λ n L µ Λ n = O L µ n Λ any scheme: Let be prediction Λ L µ n L Λ Λ n, l µ t (x <t ) l Λ t (x <t ) L Λ Λ n L ξ Λ n 2 L ξ n D n Λ L ξ n /L Λ n 1 + O ((L Λ n ) 1/2) 16
17 l t Λ ξ l t Λ µ 0 would follow from ξ µ by continuity, but l t Λ ξ is in general Fortunately, is at ξ discontinous. it continuous = µ. bound: Instantaneous loss ( ) ] n Λ t=1 E[ l ξ Λ 2 t (x <t ) l µ t (x <t ) 2D n 2lnw 1 µ < Λ 0 l ξ Λ t (x <t ) l µ t (x <t ) x t ξ(x t x <t ) µ(x t x <t ) t 2d t (x <t ) 0 w.µ.p.1 Λ 0 l ξ Λ t (x <t ) l µ Λ t (x <t ) 2d t (x <t )+2 l µ t t (x <t )d t (x <t ) 0 w.µ.p.1 The function could depend on time and even on individual history, loss it is bounded: l enough t that xt y t (x <t ) [l min, l max ], l l max l. min 4 loss Global bound: ) Λ 0 L ξ Λ n L µ Λ n l D n + 4 (L µ n n l min l D n + l 2 2 D n 17
18 Games Chance of Λ p The prot t = l xt y t [p max p, ], the total prot p max P ρ Λ n = L ρ n and average per Λ ξ 1 p n P Λ ξ 4 n. Time to win: n round prot the Λ ξ Λ p n = µ p n O(n 1/2 n Λ ) µ p n ( ) 2 2p n > Λ k Λµ µ µ Λ p n > 0 ξ p n > 0 p n w where µ = e k µ. Information-Theoretic Interpretation: we need that many bits about µ (in the worst case) by received prot. (Read transfered the book.) from 18
19 Optimality prior ξ is Pareto optimal w.r.t.: A s t, S n, d t, D n, e t, E n, l t, L n Balanced optimality L: Pareto w.r.t. Λ Λ ν 4 L ν L ξ ν 4, ν M w ν ν 0 particular, In η w 1 η max λ M λ. bounds for the mean squared sum ξ S w We have derived nν w 1 ln ν and for Λ L ξw loss Λ regret the nν L ν nν 2lnw 1 ν + 2 lnw 1 Λ ν L ν nν. of universal weights: within the set of enumerable weight Optimality with short program, the universal weights w functions ν = 2 K(ν) to lead loss within an additive (to ln w 1 smallest bounds µ in all enumerable ) constant environments. It is dicult to prove that universal weights are optimal. See excercise 3.7 in the book. 19
20 For maximum a posteriori approximator 4 max ρ(x) {w ν ν(x): ν or M} the minimum description length estimator ρ(x) 4 equivalently Multistep predictions: horizon h, 1 2 for 1 E[at:nt ] hd 2 h lnw µ for arbitrary horizon, convergence in the mean (slow) Continuous Classes: entropy bound Probability Continuous D n E µ(x 1:n) 4 ξ(x 1:n ) w µ 1 + d n 2 2π detj n + o(1) ln ln ln (j n) 1 is the Fisher information matrix for the family of distributions θ :θ R Θ d and continuity conditions hold (see the book). where M = {µ } (λx.argmin ν M {log 2 ν(x) 1 + log 2 w 1 ν })(x): [ ( E µ(xt x <t ) ρ (norm) (x t x <t ) ) ] 2 1 w µ t=1 x t ρ(x where t x <t 4 ) ρ(x 1:t )/ρ(x <t and ) ρ norm (x t x <t 4 ) ρ(x 1:t )/ x t ρ(x 1:t ) bounds are tight, thus MDL converges i.m.s. but convergence speed These can be exponentially worse than for ξ. 20
21 The Universal Algorithmic Agent AIXI Agents in Known Probabilistic Environments The agent model (deterministic case): p: X Y, y 1:k = p(x <k ), q: Y X, x 1:k = q(y 1:k )=r k o k X 4 R O. y Action k determined by a policy p depending on the I/O history is y 1 x 1 y k 1 x k 1 yx <k. pq V km 4 m i=k r(x i pq ) Future total reward the agent receives in cycles k to m: 21
22 case: µ. best agent maximizes the expected utility General environment The V p (called pµ value function) µ V 1m 4 q µ(q)v 1m p arg max 4 p pq : pq p V 1m V q km V pq km p: y pq p <k = y q <k model Functional Form AIµ in model is the agent with policy AIµ p µ maximizes the µ-expected that The r total reward r m p p µ 4 arg max, p i.e. p V µ In cycle the (future) value k V pµ km (yx <k of policy p is dened as the µ- ) of the future reward sum r expectation k + + r m µ, or true, or gener- (the value). in k history ẏẋ ating Assume cycle the is <k Q k {q: q(ẏ, let <k) = ẋ <k 4 all environments consistent with this history. Then: } be the set of pq µ(q)v V pµ km q Q k km (ẏẋ <k ) 4 µ(q) q Q k 22
23 We generalize the nite lifetime m to a dynamic farsightedness h k m k k + 1 1, called horizon. where P k 4 p k max 4 arg p P k V pµ kmk (ẏẋ <k ) {p: y k: p(ẋ <k ) = ẏ <k y k } is the set of policies consistent with By inserting p recursively k 1,, p 1 model) (AIµ : the current history. p (ẋ <k ) 4 p k (ẋ k p k 1 (ẋ <k 1 p 1 ) constant m we have: For V µ km (ẏẋ <k ) V pµ km (ẏẋ <k ) p P k sequence prediction it was enough to maximize the next reward, here (In sum of future rewards is important.) the 23
24 Chronological probability distributions: ρ(yx <ky k ) 4 AIµ model in Recursive and Iterative Form underlined arguments represent probabilistic variables and non-underlined Notation: variables represent conditions: ρ(x <n x n) = ρ(x 1:n) ρ(x <n) x k ρ(yx 1:k). Expected reward: V µ km (yx <k y k 4 ) x k [ µ r(xk ) + V k+1,m (yx 1:k ) ] µ(yx <k k) yx p y How chooses k V µ : km (yx <k max ) 4 yk V µ km (yx <k y k ). with induction start: Together the µ V m+1,m (yx 1:m ) 4 0 V µ km completely dened: is V µ km (yx <k )=max y k x k [ µ r(xk ) + V k+1,m (yx 1:k ) ] µ(yx <k k) yx 24
25 m If k function of p and ẏẋ is the horizon <k history in cycle k, is the actual ẏ k = V µ arg max kmk (ẏẋ <k ) y k the Unfolding recursion: µ ẏ k ẏ k 4 arg max max y k+1 y k x k x k+1 max y mk x mk (r(x k )+ + r(x mk )) µ(ẏẋ <k yx k:m k ) value a policy p: The for general V pµ km (yx <k ) 4 (r k + + r m ) µ(yx <k k:m) y1:m =p(x yx <m ) x 1:m Equivalence Functional and Explicit AI Model of 1:k)= µ(q) µ(yx q:q(y 1:k )=x 1:k 25
26 Factorizable environments µ. Assume that the cycles are grouped into independent episodes: s 1 µ(yx 1:n)= r=0 µ r (yx n r +1:n r+1 ) ẏ Then k on µ depends r of episode x only: and y and r ẏ k arg = max (r(x k )+ + r(x t )) µ r (ẏẋ nr k:t) y k +1:k 1yx x t with t 4 min {m k, n r+1 }. 26
27 policies. For policy π: Probabilistic V π µ = ( (r r m ) µ(x m yx <m y m )π(y m yx <m ) yx 1:m = µ(x 1 y 1 )π(y 1 )) (r r m )µ(x 1:m y 1:m ) π(y 1:m x <m ) yx 1:m (r r m )ρ(y 1:m x 1:m ) yx 1:m optimal there is always a deterministic one: Among V policies µ π = max max π p V µ p 27
28 Turing Machines reference PTMs Persistent independently introduced model of interactive computation based on non- monotonic Turing machines (three tapes: read-only, work, deterministic write-only) cuts the environment out of the loop inputs are arbitrary based on coinductive notions (coalgebras, LTSs, bisimulation) stresses innite input / output alphabets (e.g. strings) PTM operation: 28
29 Universal Algorithmic Agent AIXI The the unknown prior probability µ AI in the AIµ model by a universal prior Replace semi-probability M AI with M(q) 4 2 l(q). Equivalence of functional and iterative model holds, equivalence with recursive AI model holds after normalization of km AI still M (which will no longer be enumerable, but the universal value V pξ will be q knows length of the output from input length. still enumerable). 1:k) = M(yx q:q(y 1:k )=x 1:k 2 l(q) over enumerable chronological semimeasures: Summing all 4 1:n) 2 ξ(yx K(ρ) 1:n)= ρ(yx 1:n) M(yx ρ n k=1 ( 2 + k) µ(yx <k k) ξ(yx <k k)) ln2 K(µ) µ(yx yx yx x 1:k Just like in the sequence prediction case: and k i.m.s. if hk = m ξ(yx <k k:m yx k ) µ(yx<k k:m yx k ){ k k + 1 h max < i.m. m general for k 29
30 Order Relation. Extend the ξ-expected reward denition to programs Intelligence that are not consistent with the current history: V pξ km (ẏẋ <k 4 ) 1 N p q q:q(ẏ <k )=ẋ <k 2 l(q) V km N is the normalization factor only necessary for the expectation interpretation. <k unaltered for further where For p P k, p is p modied to output ẏẋ but cycles. is more or equally intelligent than p p i p p k ẏẋ <k : V pξ p kmk (ẏẋ <k ) V ξ kmk (ẏẋ <k ) completely unknown µ we could take ξ = M and treat AIXI as optimal by For (similarily to when we take uniform prior over parameters for the construction bandits problem). 30
31 concepts. Separability Self-optimizing policies: for p independent of µ best 1 V p µ best m 1m alternatively, The HeavenHell example: V 1m p bestµ V 1m pµ o( ) µ, p m 1 V pµ. m 1m The example: OnlyOne { } µ y : y Y, K(y )= log 2 Y, µ y (yx <k y k 1 ) δ 4 yk y k M 4 are N= Y such y. The number of errors is There E p N 1= Y = 2 K(y ) = 2 K(µ) 2 K(µ) is the best possible bound depending on K(µ); it could be OK if K(µ ẋ <k ) = O(1). µ is passive if environment is not inuenced by agent output. M and µ M are pseudo-passive if the corresponding p best = p ξ is self-optimizing. 31
32 The number suboptimal µ-expected of choices [ ] n D nµξ E 4 1 δ µ where ξ ẏk,ẏ, ẏ µ k k p µ (ẏẋ <k ) k=1 can asymptotically learned if µ be n D nµξ /n 0, i.e. Dnµξ = o(n) AIXI asymptotically any problem. Claim: can learn relevant if µ is uniform µ(yx <k y k k x ) ξ(yx <k y k k x ) < c µ(yx <k y k k) ξ(yx <k y k k) y x x k x k are relevant µ that are not uniform. Uniform µ can be asymptotically nµξ bounded horizon. There learned for appropriately weighted D and µ is forgetful if µ(yx <k yx k) becomes independent of yx <l for xed l and k. µ is farsighted if lim mk ẏ k (m k ) exists. Markovian, generalized (l th order) Markovian, ergodic, factorizable. 32
33 Value-Related Optimality, Discounted Future Value Function γ discounted weighted-average future value of probabilistic policy π in environment <k ρ-value of π given yx (the <k The ρ given history yx ): V πρ 1 kγ (yx <k ) 4 lim (γ k r k + + γ m r m )ρ(yx <k k:m)π(yx <k ȳx k:m ) Γ k m yx yx k:m γ i dened as the policy AIρ p ρ is model : discounted The. with Γ k 4 i=k p ρ arg max 4 π V πρ ρ p kγ, V kγ V ρρ 4 kγ = V πρ kγ V πρ max kγ π π Linearity convexity and of V ρ in ρ: V πξ kγ = w ν πν k V kγ V ξ and kγ w ν ν k V kγ ν M ν M ξ(yx <k). where ξ(yx <k yx k:m)= ν M w k ν ν(yx <k yx k:m) with w k ν 4 w ν ν(yx <k) Pareto optimality: there is no other policy π with V kγ πν V kγ p ξν ν M and strict inequality for at least one ν. 33
34 Balanced Pareto optimality: 0 V ν kγ V π kν ν kγ = : k 0 V ν p kγ V ξν kγ 1 ν w k k with k 4 ν M w k ν ν k where all quantities depend on history yx, <k. If there exists a sequence of self-optimizing policies then the uni- π k policy p ξ versal is self-optimizing π kν k ν π k ν: V kγ Vkγ w.ν.p.1 p V ξ µ k µ kγ Vkγ w.µ.p.1 where the probabilities are conditional on historic perceptions x <k. 34
35 The V πµ values kγ V µ and kγ in µ, and V are p µˆµ continuous kγ continuous is µˆ µˆ µ: If in at = µ(yx k) µ(yx <k k) <kyx yx ε yx <ky k k k 0 then i. ii. x k V πµ kγ V πµˆ kγ δ(ε) V µ kγ V µˆ kγ δ(ε) V µ p kγ V µˆµ kγ 2δ(ε) iii. { k k for all 0 and yx <k δ(ε) = r max min, where n k (n k)ε+ Γ n } ε 0 0. Γ k 35
36 If y 1:m = p(x 1:m and ) V (on-policy) k = V k (yx p <k the universal ), then value V undiscounted pξ with bounded dynamic horizon future kmk h k = m k k converges i.m.s. to the true value + 1 V pµ the discounted and kmk V future pξ value kγ p i.m. V pµ converges to kγ for any summable discount sequence γ k. V pξ pµ i. km V km (m k + 1)r max a k:m V pξ pµ kγ V kγ r max 2d k: ( ) 2 ii. pξ pµ k=1 E V kmk V kmk 3 2 2hmax r max D ( ) E V pξ 2 pµ kγ V kγ 2 k 2rmax (D D k 1 ) 0 pξ iii. V kmk k pµ Vkmk if h max < i.m.s. pξ k pµ V kγ Vkγ for γ any i.m. x k:m µ(yx <k k:m) ξ(yx <k, k:m) yx yx a where k:m 4 x k:m µ(yx <k k:m)ln yx µ(yx k:m) <kyx d k:m 4 ξ(yx <k yx k:m), D k 4 d 1:k lnw µ 1 <. 36
37 If M is a countable clas of ergodic MDPs, ξ 4 Markov decision process is ergodic, if there exists a policy which visits each state innitely often with probability 1. There exist self-optimizing p m for the class of ergodic MDPs: policies p m ν M MDP1 : 1 m V 1m ν 1 m V p mν 1m = O(m 1/3 ) k : k 1 unbounded horizon h With e eective k π k ν M MDP1 : 1 m V 1m π kν m V 1m ν history yx γ for any k+1 <k if 1. γ k ν M w νν, then AIξ p m ξ ν M: 1 m V p ξ m ν m 1 1m m V 1m ν and p V ξ m ν k ν kγ Vkγ γ k+1 γ k 1 maximizing V 1m pξ and p ξ maximizing V kγ πξ are self-optimizing: if If M is nite, then the speed of rst convergence is at least O(m 1/3 ). Ergodic POMDPs, ergodic l th -order MDPs and factorizable environments also allow self-optimizing policies. 37
38 faster than increases computable function any Choice of the Horizon. The Fixed (eective) horizon is OK if we know the lifetime of the agent, e.g. if probability of surviving to the next cycle is (always, independently) the γ 1, we can choose geometric discount rate γ. < General introduces eective unbounded discounting horizons. Let r k γ 4 k r k γ with k > 0 r and k [0, Γ If 1]. k 4 i=k γ i, then < pρ 1 V kγ 4 Γ lim m V pρ β k km h exists. β-eective horizon k min {h 0: Γ 4 k+h β Γ k Approximating V }. kγ h β the rst by k at of error an introduces terms βr. max most e h k 4 h β=1/2 k. Horizons γ k Γ k = β i=k γ i h k nite { 1 for k m 0 for k > m geometric γ k, 0 γ < 1 1 m k + 1 (1 β)(m k + 1) γ k 1 γ lnβ lnγ power k 1 ε, ε > 0 1 ε k ε (β 1/ε 1)k, ε > 0 1 k 1+ε k ε (lnk) ε k β 1/ε harmonic ln 2 universal K(k) decreases slower than any computable function 38
39 { horizon: ẏ ( ) Innite take k lim inf m Y k (m) (m) 4 Y k, (m ẏ k } ) where k : m k lim m. Limit m V km (yx <k exist. But immortal agents not needs ) lazy: if postponing reward makes it bigger, the agent construed this are will not get any reward. way For let ρ ξ (1 α)ξ αρ, then 4 + M, [ ] V µ sup p lim kγ V ξ µ kγ α r max k (1 α)w µ examples where equality holds. Thus a belief contamination of magnitude µ completely degenerate performance. (???) with α comparable to w can (Posterization) It is not true, that if w ν = 2 K(ν), then w k ν 2 K(ν yx <k), for ν M w k ν ν(. yx <k ) 4 ξ(. yx <k ). 39
40 ξ AI alt 1:n) 4 (yx M(yx 1:n ) x 1:n M(yx 1:n ) as random variables Actions of dening ξ as a mixture of environments, we could use a universal dis- Instead tribution over perceptions and actions and then conditionalize to the actions: M is the Solomono's prior (we could use ξ where U well). Open problems: as is ξ AI enumerable? alt ξ AI alt = ξ? AI M(yx could <k ȳ k be close to the action of ) p ξ p ξ alt for large k, jus- and/or tifying the interpretation that M(yx <k ȳ k ) is the agent's own belief in selecting action y k? 40
41 Uniform mixture of MDPs µ T (a 1 1 s a a n)=t 1 ns s0 s 1 a T n sn 1 s n Let µ T M MDP be a completely observable MDP with transition matrix T. ξ(as 1:n) 4 T w T µ T (as 1:n)dT Reward is a function of state r k = r(s k ). The Bayes mixture For a uniform prior belief, ξ(as <n as n)= ξ(as 1:n) ξ(as <n) = a N n sn 1s n a n + S 1 s N sn 1 s S = X is the number of states and where a N ss as 1:n ) number of transitions from s to s under a. is the historical (i.e. in Although T is continuous and contains non-ergodic environments, the Bayes optimal policy p ξ is self-optimizing for ergodic environments µ T M MDP1. (Intuition: T is compact and non-ergodic environments have measure zero.) 41
42 Posterior belief w T (as 1:n ) µ T (a s ) is a (complex) distribution over pos- T. Most RL algorithms only estimate a single T (e.g. a most likely, sible an expected T ). Policy p ξ appropriately explores the environment, while or popular based on E[T], or on Maximum Likelihood, lack exploration. Expected the policy transition probability: E[T a ss as 1:n ] T a 4 ss w T (as 1:n )dt = N ss a + 1 a + S T s N ss 42
43 Important Environmental Classes Here ξ = ξ U = M is the Solomono's prior, i.e. AIξ = AIXI. Prediction Sequence AIµ model and SPµ = Θ The µ (derived for binary alphabet B) for known model µ are equivalent and the expected prediction error relates to the environment function: value V 1m µ = m E m Θ µ general ξ is not symmetric in y The i r i (1 y i )(1 r i thus more dicult. We ) on deterministic computable envirnment ż = ż concentrate 1 ż 2 with Km(ż 1 ż n ) Km(ż) < and horizon m k = (greedily maximize next reward; k is sucient for SP but does not show the behavior of AIXI for a universal this We have (best proven bound): horizon). E AIξ < 1 α = 2Km(ż)+O(1) The intuitive interpretation is that each wrong prediction eliminates at least one p of size l(p) + Km(ż). The best possible bound is: program E SPξ + 2ln2Km(ż) 43
44 Strategic Games restrict ourselves to deterministic strictly competitive strategic games. We a bounded length game padded to length n. Assume the environment Assume the strategy: minimax uses ȯ k max min max = arg min min V (ẏ 1 ȯ 1 ẏ k o k y n o n ) o k y k+1 o k+1 y n o n r 1 = = r n 1 = 0, r n = 1 if the AIµ agent wins, 1 2 ẏ k AI argmax = y k arg max = y k o k o k max y n max y n 1 for draw, and 0 if environment o n V (ẏȯ <k yo k:n )µ SG (ẏȯ <k yō k:n ) max y n o n 1 minv (ẏȯ <k yo k:n )µ (ẏȯ <k yō k:n 1 ) o n SG wins. Illegal move is an instant loss. = arg max = y k min min max V (ẏȯ <k yo k:n ) = ẏ SG k o k+1 y n o n the game is played multiple times, then µ is factorizable. But if the game has If length, then µ is no longer factorizable: a better player prefers short variable games and prefers a quick draw over too long a win. 44
45 the AIξ model for game playing. The AIξ agent has only much less Using log than 2 information from a single game, so needs at least 3 number O(K(game)) games. Variable length games are better: the AIξ agent will quickly learn legal of from short initial games. Next, AIξ will learn the losing positions. Then, moves will win some games by luck or will exploit the game symmetry to learn win- AIξ positions. ning AIξ agent can take advantage of environmental players with limited ratio- The nality by not settling on the minimax strategy. 45
46 Minimization Function will consider distributions over functions: We µ FM (y 1 z 1 y nz n) 4 f:f(y i )=z i, 1 i n µ(f) model is not appropriate, because it will stick with an argument of value Greedy below the expectations for other values. For episodes of length m: already ẏ k = arg min y FM k z k min y m z m (α 1 z α m z m ) µ(ẏ 1 ż 1 ẏ k 1 ż k 1 y k z k y mz m) we want the last output to be optimal, set α If 1 = = α m 1 = 0, α m = (FMFξ); 1 we want already good approximation along the way, set α if 1 = = α m = 1 etc. (FMSξ), the model: FMξ For ξ (y 1 1 z y 4 n) nz FM q:q(y i )=z i, 1 i n 2 l(q) FMξ will never cease searching for minima, will test an innite set of y's for m FMFξ will never repeat any y except at t = m. FMSξ will test a new y. t (for m) only if the expected f(y xed t is not too large. ) 46
47 AIµ/ξ, we need r For k = α k z k, o k = z k. has problem with the FMF model: it must rst learn that it has to minimize AIξ a function. It can learn it by repeated minimization of (dierent) functions. 47
48 Learning from Examples (EX) Supervised environment presents inputs o The k 1 = z k v k (z k, v k ) R (Z {?}) Z (Y The {?}) be distributed = R O. relations might probability with ). σ(r µ (y 1 1 x y n)= µ R (o 1 o n ) nx )σ(r AI R: 1<i nr(z i,y i )=r i x where i = r i o i o and i 1 = z i v i v with i Y {?}. The agent only needs AIξ O(1) from reinforcement r bits k learn to extract z to i o from i 1 return y and i (z with i, y i ) R. 48
49 time Always tp (x) t p and the multiplicative constant (x) 1 + ε + d p huge if no is time bound exists. This could be the case for many complex approximation good AIXItl and Optimal Search Fastest and Shortest Algorithm for All Problems The Let p : a given algorithm or a specication of a function p: any prog. provably the same as p with time complexity provably in t p time tp (x): the time needed to compute t p (x) xed ε (0, 1 ) M 2 p ε computes p (x) in time algorithm the For ε time Mp (x) (1+ε) t p (x)+ d p t ε p (x)+ c p ε time constants c with p d and p of x. independent example, if matrix multiplication algorithm p with time For p (x) = d n 2 n log ε time exists, then Mp (x) (1 + ε)dn 2 for all x. n+o(1) log Blum's speed-up theorem doesn't aect this, because its speed-up sequence is The computable. not problems and for universal reinforcement learning. 49
50 Search Levin a quickly computable function g devoting 2 l(p) time portion to the Inverses inverse p. Computation time: 2 l(p) + time potential algorithm p (x) + (time p (x) for g(p(x)) x). checking = be implemented as Li & Vitanyi's SEARCH(g): run every p shorter than i can It for 2 i 2 l(p) steps in phase i = 1, 2,3, until g is inverted on x. 50
51 A. Algorithm search the space of proofs for proofs of formulas of the form Systematically C. Algorithm U on (p fast, x). For each executed step decrease t fast by 1. Run Fast M The ε Algorithm p M ε Algorithm p (x). the shared variables L 4 {}, t fast 4, p fast 4 p. Initialize algorithms A, B, C in parallel with relative computational resources Start ε,ε, 1 2 respectively. ε y: u(?p, y) = u(p, y) u(?t, y) time(?p, y) For each such proof, add the answer substitution (p, t) to L. B. for all (p, t) L: Algorithm U on (t, x) in parallel for all t with relative computational resources Run 2 l(p) l(t). U halts for some t and U(t,x)<t fast if then t fast 4 U(t, x) and p fast 4 p and restart algorithm C. if U halts then abort computation of A and B and return U(p fast, x). 51
52 AIXI Model Time-Bounded Probability Distributions. Time-Limited We could limit environments in the universal mixture ξ to those of length l computable in time t arriving at expectimax algorithm with complexity t(ẏ k AIξ t l ) = O( Y h k X h k 2 l t ). (Considered poor unintelligent.) Vote Best Algorithm. the ξ-expected future reward is enumerable: Without normalization, V pξ km (ẏẋ <k ) 2 l(q) V pq pq 4 km, V km r(x pq 4 k + r(x pq )+ m ) q Q k every we select the best (possibly inconsistent with history) policy. But At cycle V kmk (approximable, the same eort as computing ẏ AIξ is uncomputable k the let ): policy estimate it (by w k p ): p(ẏẋ <k )=w 1 p y 1 p w k p y k p policy to claim to be better than it is: valid approximation No allowed VA(p) [ k w p 1 y p 1 ẏ 1 ẋ 1 w p k y p k : p(ẏẋ <k )=w p p 1 y 1 w p p k y k w p k V pξ kmk (ẏẋ <k ) ξ V kmk enumerable: is (p i ) VA(p i lim p ) i w i k = V ξ The convergence is not. kmk uniform in k, but it is OK: we select a policy in each step. 52
53 The Universal Time-Bounded AIXItl Agent. Systematically search the space of proofs shorter than l 1. P VA(?p) and for all answer substitutions. collect 2. Eliminate all p of length > l. 3. Modify each p to stop within t time steps (aborting with w k = 0 if needed) 4. Start rst cycle: k 4 1. Run every p(ẏẋ 5. <k ) = w p 1 y 1 pw p k y p k where all outputs are redirected to, auxiliary tape, incrementally by adding ẏẋ some k 1 the input tape and to continuing the computation of the previous cycle. Select 6. p k 4 arg max p w p k. 7. Write p ẏ k y 4 k k to the output tape. 8. Receive input ẋ k from the environment. 9. Begin next cycle k 4 k + 1, goto step 5. with e.g. accuracy-based learning classier systems XCS by Wilson, [Compare RL by Baum & Durdanovic.] market/economy 53
54 Optimality of AIXItl. Let be any extended chronological (incremental) p program (like above) of length l(p) l and computation time per There could be policies which produce good outputs within reasonable but their justication w p or proof VA(p) take unreasonably long. time, The inconsistent programs must be able to continue strategies started by p policies. A policy can steer the environment to a direction for which other We eectively more or equally intelligent than p p if call p c p k ẏẋ <k w 1:n w 1:n : p(ẏẋ <k )=w 1 w k p(ẏẋ <k )=w 1 w k w k w k t(p) t, for which there exists a proof of VA(p) of length l cycle P then, p c The length of p is l(p ) = O(log(l t l p. P, the setup time is (at )) depending the proof search technique) t setup (p ) = O(l 2 most, on P 2 l P ) and time per cycle is t cycle (p ) = O(2 l t ) (to go faster, we computation the eliminate provably poor policies: how many Pareto-good policies are could there?) it is specialized. This requires enough separability to recover. 54
55 Since AIXI is incomputable but assumes computable environments, it gamble with other AIXIs. Are there interesting environmental cannot global time data Algorithm/Properties POMDP learning active ecient ecient optimum exploration convergence generalization iteration yes/no yes YES YES NO NO NO yes Value/Policy with nite S yes/no NO NO YES YES NO NO YES YES TD linear func.approx. yes/no NO NO yes yes/no YES NO YES YES TD general func.approx. no/yes NO NO no/yes NO YES NO YES YES TD Direct Policy Search no/yes YES NO no/yes NO YES no YES YES Planners yes/no YES yes YES YES no no YES yes Logic with Split Trees yes YES no YES NO yes YES YES YES RL Expert Advice yes/no YES YES yes/no yes NO YES NO Pred.w. Levin Search no/yes no no yes yes/no yes YES YES YES Adaptive yes/no no yes yes/no YES YES YES YES OOPS RL yes/no no NO no yes/no yes yes/no YES YES Market/Economy no YES YES YES YES NO YES NO SPXI NO YES YES yes YES YES YES YES YES AIXI no/yes YES YES YES yes YES YES YES YES AIXItl yes yes yes no/yes NO YES YES YES YES Human AIXI can incorporate prior knowledge D: just prepend it in any encoding, it will decrease K(µ) into K(µ D). classes for which AIξ M M or AIξtl M M? 55
56 Prior and OOPS reference TheNewAI Speed Prior Speed Assumption. The environment is deterministic. The cumulative prior probability measure of all x incomputable Postulate. time t by any method is at most inversely proportional to t. within Algorithm. Set t 4 1. Start a universal TM with empty input tape. Repeat: the of executed so far exceeds t: While number instructions { heads up: set t 4 2 t toss: coin unbiased otherwise exit the input contains a symbol, execute it, otherwise set set cell's symbol If and set t 4 t/2. randomly 56
57 Ordered Problem Solver Optimal searcher is n-bias-optimal (n 1) if for any maximal total search time T max > 0 A is guaranteed to solve any problem r R if the problem has a solution p C it can be created and tested in time t(p, r) P(p r) T max /n (P is the taskspecic that bias). Basic ingredients of OOPS: Interruptible low- or high-level instructions-tokens (e.g. theorem Primitives. matrix operators for neural nets, etc.). provers, prex codes. Token sequences / program prexes in domainspecic Task-specic language. Instructions can transfer control to previously selected (loops / calls). A prex is elongated on program's explicit request. tokens prex may be a complete program for some task (programs are prex- A w.r.t. a task), but may request more tokens on another task (incrementallfree growing self-delimiting programs). to previous solutions. Let p n denote a found prex solving the Access n tasks. p 1,, p n are stored or frozen in nonmodiable memory rst by all tasks (accessible to p n+1 ) (but can be copied into modiable shared memory). task-specic Initial task-dependent, user-provided probability distribution on program Bias. prexes. 57
58 sux probabilities. Any executed prex can assign probability Self-computed distribution to its continuations. Distribution is encoded and manipulated in task-specic internal memory. searches. Run in parallel until p n+1 is discovered. The rst is exhaustive: Two tests all possible prexes in parallel on all tasks up to n + 1. The is focused: searches for prexes starting with p n and tests only on second n + 1 (such prexes already solve tasks up to n). When optimal solver task found as p n 0, at most half of future run time is wasted by the rst is search. backtracing. Depth-rst search in program space, with backtracing Bias-optimal triggered by running over time (prex probability multiplied by total search time so far). Space is reused. / experiments. Interpreter for FORTH-like language with recursive Example functions, loops, arithmetic, bias-shifting instructions, domain-specic instructions. First taught about recursion: samples of CF language {1 k 2 k }, k 30. This took 1/3 of a day. (OOPS found universal solver for all k.) Then, by rewriting its search procedure, it learned k-disk Towers of Hanoi within a couple of days. 58
59 OOPS-Based Reinforcement Learning. Two OOPS modules: 1. The predictor is rst trained to nd a better world model. The second module (control program) will then use the model to 2. for a future action sequence with better cumulative reward. search After current cycle's time for control program is nished, we will 3. the current action of the best control program found in step execute 2. OOPS is 8-bias-optimal. 59
60 Goedel Machine reference GoedelMachines executing some initial problem solving strategy, Goedel Machine simultaneously While runs a proof searcher which systematically and repeatedly tests proof tech- An unguarded part of GM switchprog can rewrite the whole GM. It is niques. executed when the GM has found a proof, that it will result in bigger only expected reward. 60
61 program that prints itself. A is no problem with part of a program to represent the whole program, to There Optimal Self-Changes. Given any formalizable utility function u Globally assuming consistency of the underlying formal system A, any self- and of p obtained through execution of some switchprog identied change the proof of a target theorem [that running switchprog increases through reward] is globally optimal: the utility of executing the present expected is higher than the utility of waiting for the proof searcher to switchprog produce an alternative switchprog later. any degree of accuracy. main(){char q=34, n=10,*a="main() {char q=34,n=10,*a=%c%s%c; printf(a,q,a,q,n);}%c";printf(a,q,a,q,n);} 61
62 Proof Techniques 1. get-axiom(n) a. Hardware axioms b. Reward axioms c. Environment axioms d. Uncertainty axioms and string manipulation axioms e. Initial state axioms f. Utility axioms 2. apply-rule(k, m, n) 3. delete-theorem(m) 4. set-switchprog(m,n) 5. state2theorem(m, n) GM hardware can itself be probabilistic, this has to be represented by a The logic and in exptectations about which theorems are. probabilistic 62
63 Possible Types of Goedel Machine Self-improvements Just change the ratio of time-sharing between the proof searching subroutine 1. and the subpolicy ethose parts of p responsible for environment interaction. Modify e only. For example, to conduct some experiments and use the 2. knowledge. (Even if it turns out that it would have been better resulting stick with previous routine, the expectation of reward can favor experimentation.) to 3. Modify the axioms to speed up theorem proving. Modify the utility function and target theorem, so that the new values are 4. according to current target theorem. better 5. Modify the probability distribution on proof techniques, etc. 6. Do promptly a very limited rewrite to meet some deadline. In certain uninteresting environments, trash almost all of the GM and 7. a looping call to a pleasure center-activating function. leave 8. Take actions in external environment to augment the machine's hardware. 63
Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions
Marcus Hutter - 1 - Universal Artificial Intelligence Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions Marcus Hutter Istituto Dalle Molle
More informationcomp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence
comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence Marcus Hutter Australian National University Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Universal Rational Agents
More informationTheoretically Optimal Program Induction and Universal Artificial Intelligence. Marcus Hutter
Marcus Hutter - 1 - Optimal Program Induction & Universal AI Theoretically Optimal Program Induction and Universal Artificial Intelligence Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza
More informationUniversal Convergence of Semimeasures on Individual Random Sequences
Marcus Hutter - 1 - Convergence of Semimeasures on Individual Sequences Universal Convergence of Semimeasures on Individual Random Sequences Marcus Hutter IDSIA, Galleria 2 CH-6928 Manno-Lugano Switzerland
More informationUNIVERSAL ALGORITHMIC INTELLIGENCE
Technical Report IDSIA-01-03 In Artificial General Intelligence, 2007 UNIVERSAL ALGORITHMIC INTELLIGENCE A mathematical top down approach Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland
More informationApproximate Universal Artificial Intelligence
Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National
More informationConvergence and Error Bounds for Universal Prediction of Nonbinary Sequences
Technical Report IDSIA-07-01, 26. February 2001 Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland marcus@idsia.ch
More informationOptimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Journal of Machine Learning Research 4 (2003) 971-1000 Submitted 2/03; Published 11/03 Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet Marcus Hutter IDSIA, Galleria 2
More informationTowards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions
Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland, marcus@idsia.ch, http://www.idsia.ch/
More informationPredictive MDL & Bayes
Predictive MDL & Bayes Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Marcus Hutter - 2 - Predictive MDL & Bayes Contents PART I: SETUP AND MAIN RESULTS PART II: FACTS,
More informationOptimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Technical Report IDSIA-02-02 February 2002 January 2003 Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland
More informationApproximate Universal Artificial Intelligence
Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter Dave Silver UNSW / NICTA / ANU / UoA September 8, 2010 General Reinforcement Learning
More informationIntroduction to Kolmogorov Complexity
Introduction to Kolmogorov Complexity Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Marcus Hutter - 2 - Introduction to Kolmogorov Complexity Abstract In this talk I will give
More informationUniversal Learning Theory
Universal Learning Theory Marcus Hutter RSISE @ ANU and SML @ NICTA Canberra, ACT, 0200, Australia marcus@hutter1.net www.hutter1.net 25 May 2010 Abstract This encyclopedic article gives a mini-introduction
More informationMeasuring Agent Intelligence via Hierarchies of Environments
Measuring Agent Intelligence via Hierarchies of Environments Bill Hibbard SSEC, University of Wisconsin-Madison, 1225 W. Dayton St., Madison, WI 53706, USA test@ssec.wisc.edu Abstract. Under Legg s and
More informationPreface These notes were prepared on the occasion of giving a guest lecture in David Harel's class on Advanced Topics in Computability. David's reques
Two Lectures on Advanced Topics in Computability Oded Goldreich Department of Computer Science Weizmann Institute of Science Rehovot, Israel. oded@wisdom.weizmann.ac.il Spring 2002 Abstract This text consists
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationUniversal probability distributions, two-part codes, and their optimal precision
Universal probability distributions, two-part codes, and their optimal precision Contents 0 An important reminder 1 1 Universal probability distributions in theory 2 2 Universal probability distributions
More informationOn the Foundations of Universal Sequence Prediction
Marcus Hutter - 1 - Universal Sequence Prediction On the Foundations of Universal Sequence Prediction Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928
More informationCISC 876: Kolmogorov Complexity
March 27, 2007 Outline 1 Introduction 2 Definition Incompressibility and Randomness 3 Prefix Complexity Resource-Bounded K-Complexity 4 Incompressibility Method Gödel s Incompleteness Theorem 5 Outline
More informationAlgorithmic Probability
Algorithmic Probability From Scholarpedia From Scholarpedia, the free peer-reviewed encyclopedia p.19046 Curator: Marcus Hutter, Australian National University Curator: Shane Legg, Dalle Molle Institute
More informationAdvances in Universal Artificial Intelligence
Advances in Universal Artificial Intelligence Marcus Hutter Australian National University Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Marcus Hutter - 2 - Universal Artificial Intelligence
More information3 Self-Delimiting Kolmogorov complexity
3 Self-Delimiting Kolmogorov complexity 3. Prefix codes A set is called a prefix-free set (or a prefix set) if no string in the set is the proper prefix of another string in it. A prefix set cannot therefore
More informationFrom inductive inference to machine learning
From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences
More informationNew Error Bounds for Solomonoff Prediction
Technical Report IDSIA-11-00, 14. November 000 New Error Bounds for Solomonoff Prediction Marcus Hutter IDSIA, Galleria, CH-698 Manno-Lugano, Switzerland marcus@idsia.ch http://www.idsia.ch/ marcus Keywords
More informationOn the Optimality of General Reinforcement Learners
Journal of Machine Learning Research 1 (2015) 1-9 Submitted 0/15; Published 0/15 On the Optimality of General Reinforcement Learners Jan Leike Australian National University Canberra, ACT, Australia Marcus
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationto fast sorting algorithms of which the provable average run-time is of equal order of magnitude as the worst-case run-time, even though this average
Average Case Complexity under the Universal Distribution Equals Worst Case Complexity Ming Li University of Waterloo, Department of Computer Science Waterloo, Ontario N2L 3G1, Canada aul M.B. Vitanyi Centrum
More informationLoss Bounds and Time Complexity for Speed Priors
Loss Bounds and Time Complexity for Speed Priors Daniel Filan, Marcus Hutter, Jan Leike Abstract This paper establishes for the first time the predictive performance of speed priors and their computational
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationThe Fastest and Shortest Algorithm for All Well-Defined Problems 1
Technical Report IDSIA-16-00, 3 March 2001 ftp://ftp.idsia.ch/pub/techrep/idsia-16-00.ps.gz The Fastest and Shortest Algorithm for All Well-Defined Problems 1 Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano,
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional
More informationUniversal Knowledge-Seeking Agents for Stochastic Environments
Universal Knowledge-Seeking Agents for Stochastic Environments Laurent Orseau 1, Tor Lattimore 2, and Marcus Hutter 2 1 AgroParisTech, UMR 518 MIA, F-75005 Paris, France INRA, UMR 518 MIA, F-75005 Paris,
More informationarxiv:cs/ v2 [cs.ai] 27 Nov 2003
The New AI: General & Sound & Relevant for Physics Technical Report IDSIA-04-03, Version 2.0, Nov 2003 (based on Version 1.0 [57], Feb 2003). To appear in B. Goertzel and C. Pennachin, eds.:, arxiv:cs/0302012v2
More informationWeighted Majority and the Online Learning Approach
Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew
More informationConfident Bayesian Sequence Prediction. Tor Lattimore
Confident Bayesian Sequence Prediction Tor Lattimore Sequence Prediction Can you guess the next number? 1, 2, 3, 4, 5,... 3, 1, 4, 1, 5,... 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More information1 Showing Recognizability
CSCC63 Worksheet Recognizability and Decidability 1 1 Showing Recognizability 1.1 An Example - take 1 Let Σ be an alphabet. L = { M M is a T M and L(M) }, i.e., that M accepts some string from Σ. Prove
More informationCS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability
CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)
More informationComplexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler
Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard
More informationOutline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.
Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationCS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression
CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression Rosencrantz & Guildenstern Are Dead (Tom Stoppard) Rigged Lottery? And the winning numbers are: 1, 2, 3, 4, 5, 6 But is
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationLecture 13: Foundations of Math and Kolmogorov Complexity
6.045 Lecture 13: Foundations of Math and Kolmogorov Complexity 1 Self-Reference and the Recursion Theorem 2 Lemma: There is a computable function q : Σ* Σ* such that for every string w, q(w) is the description
More informationArtificial Intelligence. 3 Problem Complexity. Prof. Dr. Jana Koehler Fall 2016 HSLU - JK
Artificial Intelligence 3 Problem Complexity Prof. Dr. Jana Koehler Fall 2016 Agenda Computability and Turing Machines Tractable and Intractable Problems P vs. NP Decision Problems Optimization problems
More informationOnline Prediction: Bayes versus Experts
Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Fall 2016 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables o Axioms of probability o Joint, marginal, conditional probability
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationAsymptotics of Discrete MDL for Online Prediction
Technical Report IDSIA-13-05 Asymptotics of Discrete MDL for Online Prediction arxiv:cs.it/0506022 v1 8 Jun 2005 Jan Poland and Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland {jan,marcus}@idsia.ch
More informationStatistical Learning. Philipp Koehn. 10 November 2015
Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum
More informationMathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector
On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More information2 Plain Kolmogorov Complexity
2 Plain Kolmogorov Complexity In this section, we introduce plain Kolmogorov Complexity, prove the invariance theorem - that is, the complexity of a string does not depend crucially on the particular model
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationUsing Localization and Factorization to Reduce the Complexity of Reinforcement Learning
Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning Peter Sunehag 1,2 and Marcus Hutter 1 Sunehag@google.com, Marcus.Hutter@anu.edu.au 1 Research School of Computer
More informationAlgorithmic Information Theory
Algorithmic Information Theory [ a brief non-technical guide to the field ] Marcus Hutter RSISE @ ANU and SML @ NICTA Canberra, ACT, 0200, Australia marcus@hutter1.net www.hutter1.net March 2007 Abstract
More informationA Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997
A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationComputability Theory
Computability Theory Cristian S. Calude May 2012 Computability Theory 1 / 1 Bibliography M. Sipser. Introduction to the Theory of Computation, PWS 1997. (textbook) Computability Theory 2 / 1 Supplementary
More informationOn the Computability of AIXI
On the Computability of AIXI Jan Leike Australian National University jan.leike@anu.edu.au Marcus Hutter Australian National University marcus.hutter@anu.edu.au Abstract How could we solve the machine
More informationComplexity 6: AIT. Outline. Dusko Pavlovic. Kolmogorov. Solomonoff. Chaitin: The number of wisdom RHUL Spring Complexity 6: AIT.
Outline Complexity Theory Part 6: did we achieve? Algorithmic information and logical depth : Algorithmic information : Algorithmic probability : The number of wisdom RHUL Spring 2012 : Logical depth Outline
More informationComputational Tasks and Models
1 Computational Tasks and Models Overview: We assume that the reader is familiar with computing devices but may associate the notion of computation with specific incarnations of it. Our first goal is to
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationSelf-Modification and Mortality in Artificial Agents
Self-Modification and Mortality in Artificial Agents Laurent Orseau 1 and Mark Ring 2 1 UMR AgroParisTech 518 / INRA 16 rue Claude Bernard, 75005 Paris, France laurent.orseau@agroparistech.fr http://www.agroparistech.fr/mia/orseau
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationComplex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity
Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Eckehard Olbrich MPI MiS Leipzig Potsdam WS 2007/08 Olbrich (Leipzig) 26.10.2007 1 / 18 Overview 1 Summary
More informationFinal. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.
CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationComputability and Complexity
Computability and Complexity Decidability, Undecidability and Reducibility; Codes, Algorithms and Languages CAS 705 Ryszard Janicki Department of Computing and Software McMaster University Hamilton, Ontario,
More information