environments µ Factorizable policies Probabilistic Table of contents Universal Sequence Predict

Size: px
Start display at page:

Download "environments µ Factorizable policies Probabilistic Table of contents Universal Sequence Predict"

Transcription

1 Universal Articial Intelligence Decisions Based on Algorithmic Sequential Probability adapted by ukasz Stafiniak from the book by Marcus Hutter

2 environments µ Factorizable policies Probabilistic Table of contents Universal Sequence Prediction Epicurus, Hume, Bayes, Solomono Ockham, Information Theory and Probability Algorithmic Error and Loss Bounds Convergence, Convergence bounds Error bounds Loss of Chance Games Optimality The Universal Algorithmic Agent AIXI in Known Probabilistic Environments Agents model in Functional Form AIµ model in Recursive and Iterative Form AIµ Turing Machines reference PTMs Persistent Universal Algorithmic Agent AIXI The 2

3 Order Relation Intelligence concepts Separability Optimality, Discounted Future Value Function Value-Related Choice of the Horizon The as random variables Actions mixture of MDPs Uniform Environmental Classes Important Prediction Sequence Games Strategic the AIξ model for game playing Using Minimization Function Learning from Examples (EX) Supervised AIXItl and Optimal Search The Fastest and Shortest Algorithm for All Problems Search Levin. Algorithm M ε The Fast p AIXI Model Time-Bounded Probability Distributions Time-Limited Best Vote Algorithm

4 Universal Time-Bounded AIXItl Agent The of AIXItl Optimality program that prints itself A Techniques Proof Prior and OOPS reference TheNewAI Speed Prior Speed Ordered Problem Solver Optimal Machine reference GoedelMachines Goedel Possible Types of Goedel Machine Self-improvements

5 Universal Sequence Prediction Epicurus, Hume, Bayes, Solomono Ockham, Epicurus' principle of multiple explanations Occam's razor (simplicity) principle Hume's negation of induction Bayes' rule for conditional probabilities Solomono's universal theory of inductive inference here = reasoning about future from past experience. Induction approach (transductive inference) = predictions without building a Prequential model. Every induction problem can be phrased as sequence prediction task. Classication is a special case of sequence prediction. We are interested in maximizing prot / minimizing loss. Separating noise from data is not necessary. 5

6 Information Theory and Probability Algorithmic A prex code (prex-free set of strings) satises: x P 2 l(x) 1 Kolmogorov complexity: min K(x) {l(p): U(p) = x}, K(x y) 4 min {l(p): 4 U(y p)=x} p p Properties: K(x) + l(x)+2log 2 l(x) K(x, y)= + K(y x,k(x))+k(x) K(x) + log 2 P(x)+K(P) if P:B [0, 1] enum. and x P(x) 1 Bayes rule: p(h i D)= p(d H i )p(h i ) i I p(d H i)p(h i ) Kolmogorov complexity is only co-enumerable=upper semi-computable. 6

7 Universal prior: probability that the output of a universal monotone TM with x when provided with fair coin ips on the input tape starts 4 M(x) p:u(p)=x 2 l(p) µ 0 is a semimeasure if µ(ǫ) 1 and µ(x) µ(x0) + µ(x1) (a probability measure if equality holds). Universality of M. M multiplicatively dominates all enumerable semimeasures: M(x) + 2 K(ρ) ρ(x) ρ is an enumerable semimeasure. M is enumerable but not where Conditioning on a string: estimable. M(y x) 4 M(x y) M(x) 2 K(y x) 7

8 Try to predict the continuation x n B of a given sequence x 1 x n 1. (1 M(x t x <t )) t=1 1 2 ln2 Km(x 1: ) t=1 lnm(x t x <t ) = 1 2 lnm(x 1: ) If x 1: is computable, then Km(x 1: ) <, and M(x t x <t ) 1. Assume now the true sequence is drawn from a computable probability distribution µ. The probability of x n given x <n is µ(x n x <n ) = µ(x 1:n )/ µ(x <n ). t=1 µ(x <t )(M(0 x <t ) µ(0 x <t )) ln2 K(µ)< 2 x t 1 <t B Posterior convergence of M to µ: is, M(0 x That <t ) µ(0 x <t tends to zero with µ-probability 1. We will ) a proof later, by approximating M with see ξ(x) 4 ν M 2 K(ν) ν(x) 8

9 A sequence is µ-martin-loef random (µ.m.l.) i: c nm(x 1:n ) cµ(x 1:n ) µ/ξ-martin-loef random (µ.ξ.r.) i: is c nξ(x 1:n ) cµ(x 1:n ) A theorem true for all µ-m.l. random sequences is true with µ- 1. probability Complexity increase: K(yx) (prex Kolmogorov complexity) K(y) K(x y)+o(1) C(y x) C(y) K (x y) + (plain Kolmogorov com- O(K(C(y)) plexity) KM(yx) KM(y) K (x y) + O(K(l(y)), KM(x) 4 log 2 M(x) KM(yx) KM(y) K (µ y) log 2 µ(x y) + O(K(l(y)) where K {K, C }. 9

10 Predictor based on K fails, due to K(x1)= + K(x0). The monotone complexity Km(x) 4 min p {l(p): U(p) = x does not suer from this. } 4 m(x) 2 Km(x) extremely close to M(x). m = 2 Km is converges on-sequence rapidly n 1 m(x t x <t ) 1 2 Km(x 1:n) t=1 m(x t x <t ) 1 at most Km(x 1: ) times may converge slowly o-sequence n m(x t x <t ) 2 Km(x 1:n) t=1 s U,x 1: : Km(x 1: )=s t=1 m(x t x <t ) 2 s 2 may converge probabilistic environments not for t msr µ M comp \ M det : m(x t x <t ) µ(xt x <t ) x 1: m is not a semimeasure, but normalization does not improve the above. 10

11 Error and Loss Bounds Convergence, Assumptions: a mixture distribution is a ξ w ν sum of probability -weighted ξ(x t x <t )= ν M w ν (x <t )ν(x t x <t ), w ν (x 1:t ) 4 w ν (x <t ) ν(x t x <t ) ξ(x t x <t ) distributions ν of a set M containing the true distribution µ. and w ν (ǫ)=w ν. Distance measures: absolute (or Manhattan): a(y, z) 4 quadratic (or squared Euclidean): s(y, z) 4 i y i z i i (y i z i ) 2 (squared) Hellinger distance: h(y, z) 4 relative or divergence: 4 d(y, entropy KL z) 4 b(y, z) i y i y i z ln divergence: absolute i Entropy inequalities: f f d 2 s d b d a 2d i ( ) 2 yi zi i y iln y i z i h d 11

12 Instantaneous time and distances ξ: (at t) total between and µ X = {1,, N }, N = X, i = x t, y i = µ(x t x <t ), z i = ξ(x t x <t ) n a t (x <t 4 ) x t µ(x t x <t ) ξ(x t x <t ), A n 4 t=1 E[a t (x <t )] s t (x <t 4 ) x t (µ(x t x <t ) ξ(x t x <t )) 2 n, S n 4 t=1 E[s t (x <t )] ( ) 2, h t (x <t 4 ) x t µ(x t x <t ) ξ(x t x <t ) Hn 4 n E[h t (x <t )] t=1 d t (x <t 4 ) x t µ(x t x µ(x t x <t ) <t, D n ξ(x )ln t x <t ) n t=1 4 b t (x <t 4 ) x t µ(x t x <t ) µ(x t x <t ) n ln, B n 4 ξ(x t x <t ) t=1 E[d t (x <t )] For the rst convergence result on next page says: example n [ ] 1 E lnw µ t=1 x t (µ(x t x <t ) ξ(x t x <t )) 2 E[b t (x <t )] 12

13 Convergence Convergence of ξ to µ: S n D n lnw µ 1 < t s t (x <t ) d t (x <t ) 0 w.µ.p.1 ξ(x t x <t ) µ(x t t x <t ) 0 w.µ.p.1 (and i.m.s.) for any xt [ ( ) ] 2 n ξ(x E t x <t) 1 H n D n lnw 1 µ < t=1 µ(x t x <t ) ξ(x t x <t ) t ξ(x µ(x t x <t ) 1 t x <t ) t and i.m.s. µ(x t x <t ) 1 w.µ.p.1 b t (x <t ) d t (x <t ) a t (x <t ) 2d t (x <t ) B, n D n A n 2n D n w where µ the weight of µ in ξ and x is 1: arbitrary (nonrandom) is sequence. µ/ξ-randomness cannot be decided from ξ being a mixture distribution the dominance property alone. (E.g., for Bernoulli sequences, it is and to denseness of M related Θ.) 13

14 When M, but µ µˆ with the KL divergence M D n (µ µˆ) 4 x 1:n µ(x µ(x 1:n) 1:n )ln µˆ(x 1:n ) [ D n ln = E µ(x 1:n) ξ(x 1:n ) c, then ] [ = E µˆ(x ] [ 1:n) + E µ(x ] 1:n) lnw µˆ 1 ln + ln c ξ(x 1:n ) µˆ(x 1:n ) Error bounds Θ scheme Θ A prediction ρ predicts x ρ t 4 argmax xt ρ(x t x <t ). Probability of wrong and expected of making the number errors: prediction n [ ] Θ e ρ Θ t (x <t ) 1 µ(x ρ Θ 4 t x <t ), E ρ Θ n 4 E e ρ t (x <t ) Error bound: t=1 ) Θ 0 E ξ Θ n E µ Θ n 2 (E ξ Θ n + E µ n S n Θ S n + 4E µ 2 n S n + S n Θ 2S n + 2 E µ n S n 14

15 Loss bounds l Let xt y t R be the received loss when taking action y t and Y x t is X t th symbol of the sequence. W.l.o.g., 0 l the xt y t 1. We name an action a even if X Y. prediction A prediction scheme Λ ρ predicts Λ y ρ t min 4 arg y t Y x t ρ(x t x <t )l xt y t l t Λ ρ (x <t ) 4 E t [l xt y t Λρ n Λ ], L ρ n 4 t=1 [ ] Λ E l ρ t (x <t ) The actual and total µ-exptected loss: 15

16 Unit bound: loss Λ 0 L ξ Λ n L µ Λ n D n + 4L µ 2 n D n + D n Λ 2D n + 2 E µ n D n Collorary: Λ L ξ L is Λ µ nite nite, is Λ L ξ 2D 2lnw 1 µ µ for deterministic x yl if xy = 0, ( Λ L ξ Λ n /L µ Λ n = 1 + O (L µ n ) 1/2) L Λµ n 1 ( ) Λ L ξ Λ n L µ Λ n = O L µ n Λ any scheme: Let be prediction Λ L µ n L Λ Λ n, l µ t (x <t ) l Λ t (x <t ) L Λ Λ n L ξ Λ n 2 L ξ n D n Λ L ξ n /L Λ n 1 + O ((L Λ n ) 1/2) 16

17 l t Λ ξ l t Λ µ 0 would follow from ξ µ by continuity, but l t Λ ξ is in general Fortunately, is at ξ discontinous. it continuous = µ. bound: Instantaneous loss ( ) ] n Λ t=1 E[ l ξ Λ 2 t (x <t ) l µ t (x <t ) 2D n 2lnw 1 µ < Λ 0 l ξ Λ t (x <t ) l µ t (x <t ) x t ξ(x t x <t ) µ(x t x <t ) t 2d t (x <t ) 0 w.µ.p.1 Λ 0 l ξ Λ t (x <t ) l µ Λ t (x <t ) 2d t (x <t )+2 l µ t t (x <t )d t (x <t ) 0 w.µ.p.1 The function could depend on time and even on individual history, loss it is bounded: l enough t that xt y t (x <t ) [l min, l max ], l l max l. min 4 loss Global bound: ) Λ 0 L ξ Λ n L µ Λ n l D n + 4 (L µ n n l min l D n + l 2 2 D n 17

18 Games Chance of Λ p The prot t = l xt y t [p max p, ], the total prot p max P ρ Λ n = L ρ n and average per Λ ξ 1 p n P Λ ξ 4 n. Time to win: n round prot the Λ ξ Λ p n = µ p n O(n 1/2 n Λ ) µ p n ( ) 2 2p n > Λ k Λµ µ µ Λ p n > 0 ξ p n > 0 p n w where µ = e k µ. Information-Theoretic Interpretation: we need that many bits about µ (in the worst case) by received prot. (Read transfered the book.) from 18

19 Optimality prior ξ is Pareto optimal w.r.t.: A s t, S n, d t, D n, e t, E n, l t, L n Balanced optimality L: Pareto w.r.t. Λ Λ ν 4 L ν L ξ ν 4, ν M w ν ν 0 particular, In η w 1 η max λ M λ. bounds for the mean squared sum ξ S w We have derived nν w 1 ln ν and for Λ L ξw loss Λ regret the nν L ν nν 2lnw 1 ν + 2 lnw 1 Λ ν L ν nν. of universal weights: within the set of enumerable weight Optimality with short program, the universal weights w functions ν = 2 K(ν) to lead loss within an additive (to ln w 1 smallest bounds µ in all enumerable ) constant environments. It is dicult to prove that universal weights are optimal. See excercise 3.7 in the book. 19

20 For maximum a posteriori approximator 4 max ρ(x) {w ν ν(x): ν or M} the minimum description length estimator ρ(x) 4 equivalently Multistep predictions: horizon h, 1 2 for 1 E[at:nt ] hd 2 h lnw µ for arbitrary horizon, convergence in the mean (slow) Continuous Classes: entropy bound Probability Continuous D n E µ(x 1:n) 4 ξ(x 1:n ) w µ 1 + d n 2 2π detj n + o(1) ln ln ln (j n) 1 is the Fisher information matrix for the family of distributions θ :θ R Θ d and continuity conditions hold (see the book). where M = {µ } (λx.argmin ν M {log 2 ν(x) 1 + log 2 w 1 ν })(x): [ ( E µ(xt x <t ) ρ (norm) (x t x <t ) ) ] 2 1 w µ t=1 x t ρ(x where t x <t 4 ) ρ(x 1:t )/ρ(x <t and ) ρ norm (x t x <t 4 ) ρ(x 1:t )/ x t ρ(x 1:t ) bounds are tight, thus MDL converges i.m.s. but convergence speed These can be exponentially worse than for ξ. 20

21 The Universal Algorithmic Agent AIXI Agents in Known Probabilistic Environments The agent model (deterministic case): p: X Y, y 1:k = p(x <k ), q: Y X, x 1:k = q(y 1:k )=r k o k X 4 R O. y Action k determined by a policy p depending on the I/O history is y 1 x 1 y k 1 x k 1 yx <k. pq V km 4 m i=k r(x i pq ) Future total reward the agent receives in cycles k to m: 21

22 case: µ. best agent maximizes the expected utility General environment The V p (called pµ value function) µ V 1m 4 q µ(q)v 1m p arg max 4 p pq : pq p V 1m V q km V pq km p: y pq p <k = y q <k model Functional Form AIµ in model is the agent with policy AIµ p µ maximizes the µ-expected that The r total reward r m p p µ 4 arg max, p i.e. p V µ In cycle the (future) value k V pµ km (yx <k of policy p is dened as the µ- ) of the future reward sum r expectation k + + r m µ, or true, or gener- (the value). in k history ẏẋ ating Assume cycle the is <k Q k {q: q(ẏ, let <k) = ẋ <k 4 all environments consistent with this history. Then: } be the set of pq µ(q)v V pµ km q Q k km (ẏẋ <k ) 4 µ(q) q Q k 22

23 We generalize the nite lifetime m to a dynamic farsightedness h k m k k + 1 1, called horizon. where P k 4 p k max 4 arg p P k V pµ kmk (ẏẋ <k ) {p: y k: p(ẋ <k ) = ẏ <k y k } is the set of policies consistent with By inserting p recursively k 1,, p 1 model) (AIµ : the current history. p (ẋ <k ) 4 p k (ẋ k p k 1 (ẋ <k 1 p 1 ) constant m we have: For V µ km (ẏẋ <k ) V pµ km (ẏẋ <k ) p P k sequence prediction it was enough to maximize the next reward, here (In sum of future rewards is important.) the 23

24 Chronological probability distributions: ρ(yx <ky k ) 4 AIµ model in Recursive and Iterative Form underlined arguments represent probabilistic variables and non-underlined Notation: variables represent conditions: ρ(x <n x n) = ρ(x 1:n) ρ(x <n) x k ρ(yx 1:k). Expected reward: V µ km (yx <k y k 4 ) x k [ µ r(xk ) + V k+1,m (yx 1:k ) ] µ(yx <k k) yx p y How chooses k V µ : km (yx <k max ) 4 yk V µ km (yx <k y k ). with induction start: Together the µ V m+1,m (yx 1:m ) 4 0 V µ km completely dened: is V µ km (yx <k )=max y k x k [ µ r(xk ) + V k+1,m (yx 1:k ) ] µ(yx <k k) yx 24

25 m If k function of p and ẏẋ is the horizon <k history in cycle k, is the actual ẏ k = V µ arg max kmk (ẏẋ <k ) y k the Unfolding recursion: µ ẏ k ẏ k 4 arg max max y k+1 y k x k x k+1 max y mk x mk (r(x k )+ + r(x mk )) µ(ẏẋ <k yx k:m k ) value a policy p: The for general V pµ km (yx <k ) 4 (r k + + r m ) µ(yx <k k:m) y1:m =p(x yx <m ) x 1:m Equivalence Functional and Explicit AI Model of 1:k)= µ(q) µ(yx q:q(y 1:k )=x 1:k 25

26 Factorizable environments µ. Assume that the cycles are grouped into independent episodes: s 1 µ(yx 1:n)= r=0 µ r (yx n r +1:n r+1 ) ẏ Then k on µ depends r of episode x only: and y and r ẏ k arg = max (r(x k )+ + r(x t )) µ r (ẏẋ nr k:t) y k +1:k 1yx x t with t 4 min {m k, n r+1 }. 26

27 policies. For policy π: Probabilistic V π µ = ( (r r m ) µ(x m yx <m y m )π(y m yx <m ) yx 1:m = µ(x 1 y 1 )π(y 1 )) (r r m )µ(x 1:m y 1:m ) π(y 1:m x <m ) yx 1:m (r r m )ρ(y 1:m x 1:m ) yx 1:m optimal there is always a deterministic one: Among V policies µ π = max max π p V µ p 27

28 Turing Machines reference PTMs Persistent independently introduced model of interactive computation based on non- monotonic Turing machines (three tapes: read-only, work, deterministic write-only) cuts the environment out of the loop inputs are arbitrary based on coinductive notions (coalgebras, LTSs, bisimulation) stresses innite input / output alphabets (e.g. strings) PTM operation: 28

29 Universal Algorithmic Agent AIXI The the unknown prior probability µ AI in the AIµ model by a universal prior Replace semi-probability M AI with M(q) 4 2 l(q). Equivalence of functional and iterative model holds, equivalence with recursive AI model holds after normalization of km AI still M (which will no longer be enumerable, but the universal value V pξ will be q knows length of the output from input length. still enumerable). 1:k) = M(yx q:q(y 1:k )=x 1:k 2 l(q) over enumerable chronological semimeasures: Summing all 4 1:n) 2 ξ(yx K(ρ) 1:n)= ρ(yx 1:n) M(yx ρ n k=1 ( 2 + k) µ(yx <k k) ξ(yx <k k)) ln2 K(µ) µ(yx yx yx x 1:k Just like in the sequence prediction case: and k i.m.s. if hk = m ξ(yx <k k:m yx k ) µ(yx<k k:m yx k ){ k k + 1 h max < i.m. m general for k 29

30 Order Relation. Extend the ξ-expected reward denition to programs Intelligence that are not consistent with the current history: V pξ km (ẏẋ <k 4 ) 1 N p q q:q(ẏ <k )=ẋ <k 2 l(q) V km N is the normalization factor only necessary for the expectation interpretation. <k unaltered for further where For p P k, p is p modied to output ẏẋ but cycles. is more or equally intelligent than p p i p p k ẏẋ <k : V pξ p kmk (ẏẋ <k ) V ξ kmk (ẏẋ <k ) completely unknown µ we could take ξ = M and treat AIXI as optimal by For (similarily to when we take uniform prior over parameters for the construction bandits problem). 30

31 concepts. Separability Self-optimizing policies: for p independent of µ best 1 V p µ best m 1m alternatively, The HeavenHell example: V 1m p bestµ V 1m pµ o( ) µ, p m 1 V pµ. m 1m The example: OnlyOne { } µ y : y Y, K(y )= log 2 Y, µ y (yx <k y k 1 ) δ 4 yk y k M 4 are N= Y such y. The number of errors is There E p N 1= Y = 2 K(y ) = 2 K(µ) 2 K(µ) is the best possible bound depending on K(µ); it could be OK if K(µ ẋ <k ) = O(1). µ is passive if environment is not inuenced by agent output. M and µ M are pseudo-passive if the corresponding p best = p ξ is self-optimizing. 31

32 The number suboptimal µ-expected of choices [ ] n D nµξ E 4 1 δ µ where ξ ẏk,ẏ, ẏ µ k k p µ (ẏẋ <k ) k=1 can asymptotically learned if µ be n D nµξ /n 0, i.e. Dnµξ = o(n) AIXI asymptotically any problem. Claim: can learn relevant if µ is uniform µ(yx <k y k k x ) ξ(yx <k y k k x ) < c µ(yx <k y k k) ξ(yx <k y k k) y x x k x k are relevant µ that are not uniform. Uniform µ can be asymptotically nµξ bounded horizon. There learned for appropriately weighted D and µ is forgetful if µ(yx <k yx k) becomes independent of yx <l for xed l and k. µ is farsighted if lim mk ẏ k (m k ) exists. Markovian, generalized (l th order) Markovian, ergodic, factorizable. 32

33 Value-Related Optimality, Discounted Future Value Function γ discounted weighted-average future value of probabilistic policy π in environment <k ρ-value of π given yx (the <k The ρ given history yx ): V πρ 1 kγ (yx <k ) 4 lim (γ k r k + + γ m r m )ρ(yx <k k:m)π(yx <k ȳx k:m ) Γ k m yx yx k:m γ i dened as the policy AIρ p ρ is model : discounted The. with Γ k 4 i=k p ρ arg max 4 π V πρ ρ p kγ, V kγ V ρρ 4 kγ = V πρ kγ V πρ max kγ π π Linearity convexity and of V ρ in ρ: V πξ kγ = w ν πν k V kγ V ξ and kγ w ν ν k V kγ ν M ν M ξ(yx <k). where ξ(yx <k yx k:m)= ν M w k ν ν(yx <k yx k:m) with w k ν 4 w ν ν(yx <k) Pareto optimality: there is no other policy π with V kγ πν V kγ p ξν ν M and strict inequality for at least one ν. 33

34 Balanced Pareto optimality: 0 V ν kγ V π kν ν kγ = : k 0 V ν p kγ V ξν kγ 1 ν w k k with k 4 ν M w k ν ν k where all quantities depend on history yx, <k. If there exists a sequence of self-optimizing policies then the uni- π k policy p ξ versal is self-optimizing π kν k ν π k ν: V kγ Vkγ w.ν.p.1 p V ξ µ k µ kγ Vkγ w.µ.p.1 where the probabilities are conditional on historic perceptions x <k. 34

35 The V πµ values kγ V µ and kγ in µ, and V are p µˆµ continuous kγ continuous is µˆ µˆ µ: If in at = µ(yx k) µ(yx <k k) <kyx yx ε yx <ky k k k 0 then i. ii. x k V πµ kγ V πµˆ kγ δ(ε) V µ kγ V µˆ kγ δ(ε) V µ p kγ V µˆµ kγ 2δ(ε) iii. { k k for all 0 and yx <k δ(ε) = r max min, where n k (n k)ε+ Γ n } ε 0 0. Γ k 35

36 If y 1:m = p(x 1:m and ) V (on-policy) k = V k (yx p <k the universal ), then value V undiscounted pξ with bounded dynamic horizon future kmk h k = m k k converges i.m.s. to the true value + 1 V pµ the discounted and kmk V future pξ value kγ p i.m. V pµ converges to kγ for any summable discount sequence γ k. V pξ pµ i. km V km (m k + 1)r max a k:m V pξ pµ kγ V kγ r max 2d k: ( ) 2 ii. pξ pµ k=1 E V kmk V kmk 3 2 2hmax r max D ( ) E V pξ 2 pµ kγ V kγ 2 k 2rmax (D D k 1 ) 0 pξ iii. V kmk k pµ Vkmk if h max < i.m.s. pξ k pµ V kγ Vkγ for γ any i.m. x k:m µ(yx <k k:m) ξ(yx <k, k:m) yx yx a where k:m 4 x k:m µ(yx <k k:m)ln yx µ(yx k:m) <kyx d k:m 4 ξ(yx <k yx k:m), D k 4 d 1:k lnw µ 1 <. 36

37 If M is a countable clas of ergodic MDPs, ξ 4 Markov decision process is ergodic, if there exists a policy which visits each state innitely often with probability 1. There exist self-optimizing p m for the class of ergodic MDPs: policies p m ν M MDP1 : 1 m V 1m ν 1 m V p mν 1m = O(m 1/3 ) k : k 1 unbounded horizon h With e eective k π k ν M MDP1 : 1 m V 1m π kν m V 1m ν history yx γ for any k+1 <k if 1. γ k ν M w νν, then AIξ p m ξ ν M: 1 m V p ξ m ν m 1 1m m V 1m ν and p V ξ m ν k ν kγ Vkγ γ k+1 γ k 1 maximizing V 1m pξ and p ξ maximizing V kγ πξ are self-optimizing: if If M is nite, then the speed of rst convergence is at least O(m 1/3 ). Ergodic POMDPs, ergodic l th -order MDPs and factorizable environments also allow self-optimizing policies. 37

38 faster than increases computable function any Choice of the Horizon. The Fixed (eective) horizon is OK if we know the lifetime of the agent, e.g. if probability of surviving to the next cycle is (always, independently) the γ 1, we can choose geometric discount rate γ. < General introduces eective unbounded discounting horizons. Let r k γ 4 k r k γ with k > 0 r and k [0, Γ If 1]. k 4 i=k γ i, then < pρ 1 V kγ 4 Γ lim m V pρ β k km h exists. β-eective horizon k min {h 0: Γ 4 k+h β Γ k Approximating V }. kγ h β the rst by k at of error an introduces terms βr. max most e h k 4 h β=1/2 k. Horizons γ k Γ k = β i=k γ i h k nite { 1 for k m 0 for k > m geometric γ k, 0 γ < 1 1 m k + 1 (1 β)(m k + 1) γ k 1 γ lnβ lnγ power k 1 ε, ε > 0 1 ε k ε (β 1/ε 1)k, ε > 0 1 k 1+ε k ε (lnk) ε k β 1/ε harmonic ln 2 universal K(k) decreases slower than any computable function 38

39 { horizon: ẏ ( ) Innite take k lim inf m Y k (m) (m) 4 Y k, (m ẏ k } ) where k : m k lim m. Limit m V km (yx <k exist. But immortal agents not needs ) lazy: if postponing reward makes it bigger, the agent construed this are will not get any reward. way For let ρ ξ (1 α)ξ αρ, then 4 + M, [ ] V µ sup p lim kγ V ξ µ kγ α r max k (1 α)w µ examples where equality holds. Thus a belief contamination of magnitude µ completely degenerate performance. (???) with α comparable to w can (Posterization) It is not true, that if w ν = 2 K(ν), then w k ν 2 K(ν yx <k), for ν M w k ν ν(. yx <k ) 4 ξ(. yx <k ). 39

40 ξ AI alt 1:n) 4 (yx M(yx 1:n ) x 1:n M(yx 1:n ) as random variables Actions of dening ξ as a mixture of environments, we could use a universal dis- Instead tribution over perceptions and actions and then conditionalize to the actions: M is the Solomono's prior (we could use ξ where U well). Open problems: as is ξ AI enumerable? alt ξ AI alt = ξ? AI M(yx could <k ȳ k be close to the action of ) p ξ p ξ alt for large k, jus- and/or tifying the interpretation that M(yx <k ȳ k ) is the agent's own belief in selecting action y k? 40

41 Uniform mixture of MDPs µ T (a 1 1 s a a n)=t 1 ns s0 s 1 a T n sn 1 s n Let µ T M MDP be a completely observable MDP with transition matrix T. ξ(as 1:n) 4 T w T µ T (as 1:n)dT Reward is a function of state r k = r(s k ). The Bayes mixture For a uniform prior belief, ξ(as <n as n)= ξ(as 1:n) ξ(as <n) = a N n sn 1s n a n + S 1 s N sn 1 s S = X is the number of states and where a N ss as 1:n ) number of transitions from s to s under a. is the historical (i.e. in Although T is continuous and contains non-ergodic environments, the Bayes optimal policy p ξ is self-optimizing for ergodic environments µ T M MDP1. (Intuition: T is compact and non-ergodic environments have measure zero.) 41

42 Posterior belief w T (as 1:n ) µ T (a s ) is a (complex) distribution over pos- T. Most RL algorithms only estimate a single T (e.g. a most likely, sible an expected T ). Policy p ξ appropriately explores the environment, while or popular based on E[T], or on Maximum Likelihood, lack exploration. Expected the policy transition probability: E[T a ss as 1:n ] T a 4 ss w T (as 1:n )dt = N ss a + 1 a + S T s N ss 42

43 Important Environmental Classes Here ξ = ξ U = M is the Solomono's prior, i.e. AIξ = AIXI. Prediction Sequence AIµ model and SPµ = Θ The µ (derived for binary alphabet B) for known model µ are equivalent and the expected prediction error relates to the environment function: value V 1m µ = m E m Θ µ general ξ is not symmetric in y The i r i (1 y i )(1 r i thus more dicult. We ) on deterministic computable envirnment ż = ż concentrate 1 ż 2 with Km(ż 1 ż n ) Km(ż) < and horizon m k = (greedily maximize next reward; k is sucient for SP but does not show the behavior of AIXI for a universal this We have (best proven bound): horizon). E AIξ < 1 α = 2Km(ż)+O(1) The intuitive interpretation is that each wrong prediction eliminates at least one p of size l(p) + Km(ż). The best possible bound is: program E SPξ + 2ln2Km(ż) 43

44 Strategic Games restrict ourselves to deterministic strictly competitive strategic games. We a bounded length game padded to length n. Assume the environment Assume the strategy: minimax uses ȯ k max min max = arg min min V (ẏ 1 ȯ 1 ẏ k o k y n o n ) o k y k+1 o k+1 y n o n r 1 = = r n 1 = 0, r n = 1 if the AIµ agent wins, 1 2 ẏ k AI argmax = y k arg max = y k o k o k max y n max y n 1 for draw, and 0 if environment o n V (ẏȯ <k yo k:n )µ SG (ẏȯ <k yō k:n ) max y n o n 1 minv (ẏȯ <k yo k:n )µ (ẏȯ <k yō k:n 1 ) o n SG wins. Illegal move is an instant loss. = arg max = y k min min max V (ẏȯ <k yo k:n ) = ẏ SG k o k+1 y n o n the game is played multiple times, then µ is factorizable. But if the game has If length, then µ is no longer factorizable: a better player prefers short variable games and prefers a quick draw over too long a win. 44

45 the AIξ model for game playing. The AIξ agent has only much less Using log than 2 information from a single game, so needs at least 3 number O(K(game)) games. Variable length games are better: the AIξ agent will quickly learn legal of from short initial games. Next, AIξ will learn the losing positions. Then, moves will win some games by luck or will exploit the game symmetry to learn win- AIξ positions. ning AIξ agent can take advantage of environmental players with limited ratio- The nality by not settling on the minimax strategy. 45

46 Minimization Function will consider distributions over functions: We µ FM (y 1 z 1 y nz n) 4 f:f(y i )=z i, 1 i n µ(f) model is not appropriate, because it will stick with an argument of value Greedy below the expectations for other values. For episodes of length m: already ẏ k = arg min y FM k z k min y m z m (α 1 z α m z m ) µ(ẏ 1 ż 1 ẏ k 1 ż k 1 y k z k y mz m) we want the last output to be optimal, set α If 1 = = α m 1 = 0, α m = (FMFξ); 1 we want already good approximation along the way, set α if 1 = = α m = 1 etc. (FMSξ), the model: FMξ For ξ (y 1 1 z y 4 n) nz FM q:q(y i )=z i, 1 i n 2 l(q) FMξ will never cease searching for minima, will test an innite set of y's for m FMFξ will never repeat any y except at t = m. FMSξ will test a new y. t (for m) only if the expected f(y xed t is not too large. ) 46

47 AIµ/ξ, we need r For k = α k z k, o k = z k. has problem with the FMF model: it must rst learn that it has to minimize AIξ a function. It can learn it by repeated minimization of (dierent) functions. 47

48 Learning from Examples (EX) Supervised environment presents inputs o The k 1 = z k v k (z k, v k ) R (Z {?}) Z (Y The {?}) be distributed = R O. relations might probability with ). σ(r µ (y 1 1 x y n)= µ R (o 1 o n ) nx )σ(r AI R: 1<i nr(z i,y i )=r i x where i = r i o i o and i 1 = z i v i v with i Y {?}. The agent only needs AIξ O(1) from reinforcement r bits k learn to extract z to i o from i 1 return y and i (z with i, y i ) R. 48

49 time Always tp (x) t p and the multiplicative constant (x) 1 + ε + d p huge if no is time bound exists. This could be the case for many complex approximation good AIXItl and Optimal Search Fastest and Shortest Algorithm for All Problems The Let p : a given algorithm or a specication of a function p: any prog. provably the same as p with time complexity provably in t p time tp (x): the time needed to compute t p (x) xed ε (0, 1 ) M 2 p ε computes p (x) in time algorithm the For ε time Mp (x) (1+ε) t p (x)+ d p t ε p (x)+ c p ε time constants c with p d and p of x. independent example, if matrix multiplication algorithm p with time For p (x) = d n 2 n log ε time exists, then Mp (x) (1 + ε)dn 2 for all x. n+o(1) log Blum's speed-up theorem doesn't aect this, because its speed-up sequence is The computable. not problems and for universal reinforcement learning. 49

50 Search Levin a quickly computable function g devoting 2 l(p) time portion to the Inverses inverse p. Computation time: 2 l(p) + time potential algorithm p (x) + (time p (x) for g(p(x)) x). checking = be implemented as Li & Vitanyi's SEARCH(g): run every p shorter than i can It for 2 i 2 l(p) steps in phase i = 1, 2,3, until g is inverted on x. 50

51 A. Algorithm search the space of proofs for proofs of formulas of the form Systematically C. Algorithm U on (p fast, x). For each executed step decrease t fast by 1. Run Fast M The ε Algorithm p M ε Algorithm p (x). the shared variables L 4 {}, t fast 4, p fast 4 p. Initialize algorithms A, B, C in parallel with relative computational resources Start ε,ε, 1 2 respectively. ε y: u(?p, y) = u(p, y) u(?t, y) time(?p, y) For each such proof, add the answer substitution (p, t) to L. B. for all (p, t) L: Algorithm U on (t, x) in parallel for all t with relative computational resources Run 2 l(p) l(t). U halts for some t and U(t,x)<t fast if then t fast 4 U(t, x) and p fast 4 p and restart algorithm C. if U halts then abort computation of A and B and return U(p fast, x). 51

52 AIXI Model Time-Bounded Probability Distributions. Time-Limited We could limit environments in the universal mixture ξ to those of length l computable in time t arriving at expectimax algorithm with complexity t(ẏ k AIξ t l ) = O( Y h k X h k 2 l t ). (Considered poor unintelligent.) Vote Best Algorithm. the ξ-expected future reward is enumerable: Without normalization, V pξ km (ẏẋ <k ) 2 l(q) V pq pq 4 km, V km r(x pq 4 k + r(x pq )+ m ) q Q k every we select the best (possibly inconsistent with history) policy. But At cycle V kmk (approximable, the same eort as computing ẏ AIξ is uncomputable k the let ): policy estimate it (by w k p ): p(ẏẋ <k )=w 1 p y 1 p w k p y k p policy to claim to be better than it is: valid approximation No allowed VA(p) [ k w p 1 y p 1 ẏ 1 ẋ 1 w p k y p k : p(ẏẋ <k )=w p p 1 y 1 w p p k y k w p k V pξ kmk (ẏẋ <k ) ξ V kmk enumerable: is (p i ) VA(p i lim p ) i w i k = V ξ The convergence is not. kmk uniform in k, but it is OK: we select a policy in each step. 52

53 The Universal Time-Bounded AIXItl Agent. Systematically search the space of proofs shorter than l 1. P VA(?p) and for all answer substitutions. collect 2. Eliminate all p of length > l. 3. Modify each p to stop within t time steps (aborting with w k = 0 if needed) 4. Start rst cycle: k 4 1. Run every p(ẏẋ 5. <k ) = w p 1 y 1 pw p k y p k where all outputs are redirected to, auxiliary tape, incrementally by adding ẏẋ some k 1 the input tape and to continuing the computation of the previous cycle. Select 6. p k 4 arg max p w p k. 7. Write p ẏ k y 4 k k to the output tape. 8. Receive input ẋ k from the environment. 9. Begin next cycle k 4 k + 1, goto step 5. with e.g. accuracy-based learning classier systems XCS by Wilson, [Compare RL by Baum & Durdanovic.] market/economy 53

54 Optimality of AIXItl. Let be any extended chronological (incremental) p program (like above) of length l(p) l and computation time per There could be policies which produce good outputs within reasonable but their justication w p or proof VA(p) take unreasonably long. time, The inconsistent programs must be able to continue strategies started by p policies. A policy can steer the environment to a direction for which other We eectively more or equally intelligent than p p if call p c p k ẏẋ <k w 1:n w 1:n : p(ẏẋ <k )=w 1 w k p(ẏẋ <k )=w 1 w k w k w k t(p) t, for which there exists a proof of VA(p) of length l cycle P then, p c The length of p is l(p ) = O(log(l t l p. P, the setup time is (at )) depending the proof search technique) t setup (p ) = O(l 2 most, on P 2 l P ) and time per cycle is t cycle (p ) = O(2 l t ) (to go faster, we computation the eliminate provably poor policies: how many Pareto-good policies are could there?) it is specialized. This requires enough separability to recover. 54

55 Since AIXI is incomputable but assumes computable environments, it gamble with other AIXIs. Are there interesting environmental cannot global time data Algorithm/Properties POMDP learning active ecient ecient optimum exploration convergence generalization iteration yes/no yes YES YES NO NO NO yes Value/Policy with nite S yes/no NO NO YES YES NO NO YES YES TD linear func.approx. yes/no NO NO yes yes/no YES NO YES YES TD general func.approx. no/yes NO NO no/yes NO YES NO YES YES TD Direct Policy Search no/yes YES NO no/yes NO YES no YES YES Planners yes/no YES yes YES YES no no YES yes Logic with Split Trees yes YES no YES NO yes YES YES YES RL Expert Advice yes/no YES YES yes/no yes NO YES NO Pred.w. Levin Search no/yes no no yes yes/no yes YES YES YES Adaptive yes/no no yes yes/no YES YES YES YES OOPS RL yes/no no NO no yes/no yes yes/no YES YES Market/Economy no YES YES YES YES NO YES NO SPXI NO YES YES yes YES YES YES YES YES AIXI no/yes YES YES YES yes YES YES YES YES AIXItl yes yes yes no/yes NO YES YES YES YES Human AIXI can incorporate prior knowledge D: just prepend it in any encoding, it will decrease K(µ) into K(µ D). classes for which AIξ M M or AIξtl M M? 55

56 Prior and OOPS reference TheNewAI Speed Prior Speed Assumption. The environment is deterministic. The cumulative prior probability measure of all x incomputable Postulate. time t by any method is at most inversely proportional to t. within Algorithm. Set t 4 1. Start a universal TM with empty input tape. Repeat: the of executed so far exceeds t: While number instructions { heads up: set t 4 2 t toss: coin unbiased otherwise exit the input contains a symbol, execute it, otherwise set set cell's symbol If and set t 4 t/2. randomly 56

57 Ordered Problem Solver Optimal searcher is n-bias-optimal (n 1) if for any maximal total search time T max > 0 A is guaranteed to solve any problem r R if the problem has a solution p C it can be created and tested in time t(p, r) P(p r) T max /n (P is the taskspecic that bias). Basic ingredients of OOPS: Interruptible low- or high-level instructions-tokens (e.g. theorem Primitives. matrix operators for neural nets, etc.). provers, prex codes. Token sequences / program prexes in domainspecic Task-specic language. Instructions can transfer control to previously selected (loops / calls). A prex is elongated on program's explicit request. tokens prex may be a complete program for some task (programs are prex- A w.r.t. a task), but may request more tokens on another task (incrementallfree growing self-delimiting programs). to previous solutions. Let p n denote a found prex solving the Access n tasks. p 1,, p n are stored or frozen in nonmodiable memory rst by all tasks (accessible to p n+1 ) (but can be copied into modiable shared memory). task-specic Initial task-dependent, user-provided probability distribution on program Bias. prexes. 57

58 sux probabilities. Any executed prex can assign probability Self-computed distribution to its continuations. Distribution is encoded and manipulated in task-specic internal memory. searches. Run in parallel until p n+1 is discovered. The rst is exhaustive: Two tests all possible prexes in parallel on all tasks up to n + 1. The is focused: searches for prexes starting with p n and tests only on second n + 1 (such prexes already solve tasks up to n). When optimal solver task found as p n 0, at most half of future run time is wasted by the rst is search. backtracing. Depth-rst search in program space, with backtracing Bias-optimal triggered by running over time (prex probability multiplied by total search time so far). Space is reused. / experiments. Interpreter for FORTH-like language with recursive Example functions, loops, arithmetic, bias-shifting instructions, domain-specic instructions. First taught about recursion: samples of CF language {1 k 2 k }, k 30. This took 1/3 of a day. (OOPS found universal solver for all k.) Then, by rewriting its search procedure, it learned k-disk Towers of Hanoi within a couple of days. 58

59 OOPS-Based Reinforcement Learning. Two OOPS modules: 1. The predictor is rst trained to nd a better world model. The second module (control program) will then use the model to 2. for a future action sequence with better cumulative reward. search After current cycle's time for control program is nished, we will 3. the current action of the best control program found in step execute 2. OOPS is 8-bias-optimal. 59

60 Goedel Machine reference GoedelMachines executing some initial problem solving strategy, Goedel Machine simultaneously While runs a proof searcher which systematically and repeatedly tests proof tech- An unguarded part of GM switchprog can rewrite the whole GM. It is niques. executed when the GM has found a proof, that it will result in bigger only expected reward. 60

61 program that prints itself. A is no problem with part of a program to represent the whole program, to There Optimal Self-Changes. Given any formalizable utility function u Globally assuming consistency of the underlying formal system A, any self- and of p obtained through execution of some switchprog identied change the proof of a target theorem [that running switchprog increases through reward] is globally optimal: the utility of executing the present expected is higher than the utility of waiting for the proof searcher to switchprog produce an alternative switchprog later. any degree of accuracy. main(){char q=34, n=10,*a="main() {char q=34,n=10,*a=%c%s%c; printf(a,q,a,q,n);}%c";printf(a,q,a,q,n);} 61

62 Proof Techniques 1. get-axiom(n) a. Hardware axioms b. Reward axioms c. Environment axioms d. Uncertainty axioms and string manipulation axioms e. Initial state axioms f. Utility axioms 2. apply-rule(k, m, n) 3. delete-theorem(m) 4. set-switchprog(m,n) 5. state2theorem(m, n) GM hardware can itself be probabilistic, this has to be represented by a The logic and in exptectations about which theorems are. probabilistic 62

63 Possible Types of Goedel Machine Self-improvements Just change the ratio of time-sharing between the proof searching subroutine 1. and the subpolicy ethose parts of p responsible for environment interaction. Modify e only. For example, to conduct some experiments and use the 2. knowledge. (Even if it turns out that it would have been better resulting stick with previous routine, the expectation of reward can favor experimentation.) to 3. Modify the axioms to speed up theorem proving. Modify the utility function and target theorem, so that the new values are 4. according to current target theorem. better 5. Modify the probability distribution on proof techniques, etc. 6. Do promptly a very limited rewrite to meet some deadline. In certain uninteresting environments, trash almost all of the GM and 7. a looping call to a pleasure center-activating function. leave 8. Take actions in external environment to augment the machine's hardware. 63

Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions

Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions Marcus Hutter - 1 - Universal Artificial Intelligence Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions Marcus Hutter Istituto Dalle Molle

More information

comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence

comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence Marcus Hutter Australian National University Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Universal Rational Agents

More information

Theoretically Optimal Program Induction and Universal Artificial Intelligence. Marcus Hutter

Theoretically Optimal Program Induction and Universal Artificial Intelligence. Marcus Hutter Marcus Hutter - 1 - Optimal Program Induction & Universal AI Theoretically Optimal Program Induction and Universal Artificial Intelligence Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza

More information

Universal Convergence of Semimeasures on Individual Random Sequences

Universal Convergence of Semimeasures on Individual Random Sequences Marcus Hutter - 1 - Convergence of Semimeasures on Individual Sequences Universal Convergence of Semimeasures on Individual Random Sequences Marcus Hutter IDSIA, Galleria 2 CH-6928 Manno-Lugano Switzerland

More information

UNIVERSAL ALGORITHMIC INTELLIGENCE

UNIVERSAL ALGORITHMIC INTELLIGENCE Technical Report IDSIA-01-03 In Artificial General Intelligence, 2007 UNIVERSAL ALGORITHMIC INTELLIGENCE A mathematical top down approach Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland

More information

Approximate Universal Artificial Intelligence

Approximate Universal Artificial Intelligence Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National

More information

Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences

Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences Technical Report IDSIA-07-01, 26. February 2001 Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland marcus@idsia.ch

More information

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet Journal of Machine Learning Research 4 (2003) 971-1000 Submitted 2/03; Published 11/03 Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet Marcus Hutter IDSIA, Galleria 2

More information

Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions

Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland, marcus@idsia.ch, http://www.idsia.ch/

More information

Predictive MDL & Bayes

Predictive MDL & Bayes Predictive MDL & Bayes Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Marcus Hutter - 2 - Predictive MDL & Bayes Contents PART I: SETUP AND MAIN RESULTS PART II: FACTS,

More information

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet Technical Report IDSIA-02-02 February 2002 January 2003 Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland

More information

Approximate Universal Artificial Intelligence

Approximate Universal Artificial Intelligence Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter Dave Silver UNSW / NICTA / ANU / UoA September 8, 2010 General Reinforcement Learning

More information

Introduction to Kolmogorov Complexity

Introduction to Kolmogorov Complexity Introduction to Kolmogorov Complexity Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Marcus Hutter - 2 - Introduction to Kolmogorov Complexity Abstract In this talk I will give

More information

Universal Learning Theory

Universal Learning Theory Universal Learning Theory Marcus Hutter RSISE @ ANU and SML @ NICTA Canberra, ACT, 0200, Australia marcus@hutter1.net www.hutter1.net 25 May 2010 Abstract This encyclopedic article gives a mini-introduction

More information

Measuring Agent Intelligence via Hierarchies of Environments

Measuring Agent Intelligence via Hierarchies of Environments Measuring Agent Intelligence via Hierarchies of Environments Bill Hibbard SSEC, University of Wisconsin-Madison, 1225 W. Dayton St., Madison, WI 53706, USA test@ssec.wisc.edu Abstract. Under Legg s and

More information

Preface These notes were prepared on the occasion of giving a guest lecture in David Harel's class on Advanced Topics in Computability. David's reques

Preface These notes were prepared on the occasion of giving a guest lecture in David Harel's class on Advanced Topics in Computability. David's reques Two Lectures on Advanced Topics in Computability Oded Goldreich Department of Computer Science Weizmann Institute of Science Rehovot, Israel. oded@wisdom.weizmann.ac.il Spring 2002 Abstract This text consists

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Universal probability distributions, two-part codes, and their optimal precision

Universal probability distributions, two-part codes, and their optimal precision Universal probability distributions, two-part codes, and their optimal precision Contents 0 An important reminder 1 1 Universal probability distributions in theory 2 2 Universal probability distributions

More information

On the Foundations of Universal Sequence Prediction

On the Foundations of Universal Sequence Prediction Marcus Hutter - 1 - Universal Sequence Prediction On the Foundations of Universal Sequence Prediction Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928

More information

CISC 876: Kolmogorov Complexity

CISC 876: Kolmogorov Complexity March 27, 2007 Outline 1 Introduction 2 Definition Incompressibility and Randomness 3 Prefix Complexity Resource-Bounded K-Complexity 4 Incompressibility Method Gödel s Incompleteness Theorem 5 Outline

More information

Algorithmic Probability

Algorithmic Probability Algorithmic Probability From Scholarpedia From Scholarpedia, the free peer-reviewed encyclopedia p.19046 Curator: Marcus Hutter, Australian National University Curator: Shane Legg, Dalle Molle Institute

More information

Advances in Universal Artificial Intelligence

Advances in Universal Artificial Intelligence Advances in Universal Artificial Intelligence Marcus Hutter Australian National University Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Marcus Hutter - 2 - Universal Artificial Intelligence

More information

3 Self-Delimiting Kolmogorov complexity

3 Self-Delimiting Kolmogorov complexity 3 Self-Delimiting Kolmogorov complexity 3. Prefix codes A set is called a prefix-free set (or a prefix set) if no string in the set is the proper prefix of another string in it. A prefix set cannot therefore

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

New Error Bounds for Solomonoff Prediction

New Error Bounds for Solomonoff Prediction Technical Report IDSIA-11-00, 14. November 000 New Error Bounds for Solomonoff Prediction Marcus Hutter IDSIA, Galleria, CH-698 Manno-Lugano, Switzerland marcus@idsia.ch http://www.idsia.ch/ marcus Keywords

More information

On the Optimality of General Reinforcement Learners

On the Optimality of General Reinforcement Learners Journal of Machine Learning Research 1 (2015) 1-9 Submitted 0/15; Published 0/15 On the Optimality of General Reinforcement Learners Jan Leike Australian National University Canberra, ACT, Australia Marcus

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

to fast sorting algorithms of which the provable average run-time is of equal order of magnitude as the worst-case run-time, even though this average

to fast sorting algorithms of which the provable average run-time is of equal order of magnitude as the worst-case run-time, even though this average Average Case Complexity under the Universal Distribution Equals Worst Case Complexity Ming Li University of Waterloo, Department of Computer Science Waterloo, Ontario N2L 3G1, Canada aul M.B. Vitanyi Centrum

More information

Loss Bounds and Time Complexity for Speed Priors

Loss Bounds and Time Complexity for Speed Priors Loss Bounds and Time Complexity for Speed Priors Daniel Filan, Marcus Hutter, Jan Leike Abstract This paper establishes for the first time the predictive performance of speed priors and their computational

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

The Fastest and Shortest Algorithm for All Well-Defined Problems 1

The Fastest and Shortest Algorithm for All Well-Defined Problems 1 Technical Report IDSIA-16-00, 3 March 2001 ftp://ftp.idsia.ch/pub/techrep/idsia-16-00.ps.gz The Fastest and Shortest Algorithm for All Well-Defined Problems 1 Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano,

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

Universal Knowledge-Seeking Agents for Stochastic Environments

Universal Knowledge-Seeking Agents for Stochastic Environments Universal Knowledge-Seeking Agents for Stochastic Environments Laurent Orseau 1, Tor Lattimore 2, and Marcus Hutter 2 1 AgroParisTech, UMR 518 MIA, F-75005 Paris, France INRA, UMR 518 MIA, F-75005 Paris,

More information

arxiv:cs/ v2 [cs.ai] 27 Nov 2003

arxiv:cs/ v2 [cs.ai] 27 Nov 2003 The New AI: General & Sound & Relevant for Physics Technical Report IDSIA-04-03, Version 2.0, Nov 2003 (based on Version 1.0 [57], Feb 2003). To appear in B. Goertzel and C. Pennachin, eds.:, arxiv:cs/0302012v2

More information

Weighted Majority and the Online Learning Approach

Weighted Majority and the Online Learning Approach Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew

More information

Confident Bayesian Sequence Prediction. Tor Lattimore

Confident Bayesian Sequence Prediction. Tor Lattimore Confident Bayesian Sequence Prediction Tor Lattimore Sequence Prediction Can you guess the next number? 1, 2, 3, 4, 5,... 3, 1, 4, 1, 5,... 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

1 Showing Recognizability

1 Showing Recognizability CSCC63 Worksheet Recognizability and Decidability 1 1 Showing Recognizability 1.1 An Example - take 1 Let Σ be an alphabet. L = { M M is a T M and L(M) }, i.e., that M accepts some string from Σ. Prove

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression

CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression Rosencrantz & Guildenstern Are Dead (Tom Stoppard) Rigged Lottery? And the winning numbers are: 1, 2, 3, 4, 5, 6 But is

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Lecture 13: Foundations of Math and Kolmogorov Complexity

Lecture 13: Foundations of Math and Kolmogorov Complexity 6.045 Lecture 13: Foundations of Math and Kolmogorov Complexity 1 Self-Reference and the Recursion Theorem 2 Lemma: There is a computable function q : Σ* Σ* such that for every string w, q(w) is the description

More information

Artificial Intelligence. 3 Problem Complexity. Prof. Dr. Jana Koehler Fall 2016 HSLU - JK

Artificial Intelligence. 3 Problem Complexity. Prof. Dr. Jana Koehler Fall 2016 HSLU - JK Artificial Intelligence 3 Problem Complexity Prof. Dr. Jana Koehler Fall 2016 Agenda Computability and Turing Machines Tractable and Intractable Problems P vs. NP Decision Problems Optimization problems

More information

Online Prediction: Bayes versus Experts

Online Prediction: Bayes versus Experts Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Fall 2016 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables o Axioms of probability o Joint, marginal, conditional probability

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Asymptotics of Discrete MDL for Online Prediction

Asymptotics of Discrete MDL for Online Prediction Technical Report IDSIA-13-05 Asymptotics of Discrete MDL for Online Prediction arxiv:cs.it/0506022 v1 8 Jun 2005 Jan Poland and Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland {jan,marcus}@idsia.ch

More information

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical Learning. Philipp Koehn. 10 November 2015 Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

2 Plain Kolmogorov Complexity

2 Plain Kolmogorov Complexity 2 Plain Kolmogorov Complexity In this section, we introduce plain Kolmogorov Complexity, prove the invariance theorem - that is, the complexity of a string does not depend crucially on the particular model

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning

Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning Peter Sunehag 1,2 and Marcus Hutter 1 Sunehag@google.com, Marcus.Hutter@anu.edu.au 1 Research School of Computer

More information

Algorithmic Information Theory

Algorithmic Information Theory Algorithmic Information Theory [ a brief non-technical guide to the field ] Marcus Hutter RSISE @ ANU and SML @ NICTA Canberra, ACT, 0200, Australia marcus@hutter1.net www.hutter1.net March 2007 Abstract

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Computability Theory

Computability Theory Computability Theory Cristian S. Calude May 2012 Computability Theory 1 / 1 Bibliography M. Sipser. Introduction to the Theory of Computation, PWS 1997. (textbook) Computability Theory 2 / 1 Supplementary

More information

On the Computability of AIXI

On the Computability of AIXI On the Computability of AIXI Jan Leike Australian National University jan.leike@anu.edu.au Marcus Hutter Australian National University marcus.hutter@anu.edu.au Abstract How could we solve the machine

More information

Complexity 6: AIT. Outline. Dusko Pavlovic. Kolmogorov. Solomonoff. Chaitin: The number of wisdom RHUL Spring Complexity 6: AIT.

Complexity 6: AIT. Outline. Dusko Pavlovic. Kolmogorov. Solomonoff. Chaitin: The number of wisdom RHUL Spring Complexity 6: AIT. Outline Complexity Theory Part 6: did we achieve? Algorithmic information and logical depth : Algorithmic information : Algorithmic probability : The number of wisdom RHUL Spring 2012 : Logical depth Outline

More information

Computational Tasks and Models

Computational Tasks and Models 1 Computational Tasks and Models Overview: We assume that the reader is familiar with computing devices but may associate the notion of computation with specific incarnations of it. Our first goal is to

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Self-Modification and Mortality in Artificial Agents

Self-Modification and Mortality in Artificial Agents Self-Modification and Mortality in Artificial Agents Laurent Orseau 1 and Mark Ring 2 1 UMR AgroParisTech 518 / INRA 16 rue Claude Bernard, 75005 Paris, France laurent.orseau@agroparistech.fr http://www.agroparistech.fr/mia/orseau

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Eckehard Olbrich MPI MiS Leipzig Potsdam WS 2007/08 Olbrich (Leipzig) 26.10.2007 1 / 18 Overview 1 Summary

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Computability and Complexity

Computability and Complexity Computability and Complexity Decidability, Undecidability and Reducibility; Codes, Algorithms and Languages CAS 705 Ryszard Janicki Department of Computing and Software McMaster University Hamilton, Ontario,

More information