Confident Bayesian Sequence Prediction. Tor Lattimore

Size: px

Start display at page:

Download "Confident Bayesian Sequence Prediction. Tor Lattimore"

Cameron Henderson
5 years ago
Views:

1 Confident Bayesian Sequence Prediction Tor Lattimore

2 Sequence Prediction Can you guess the next number? 1, 2, 3, 4, 5,... 3, 1, 4, 1, 5,... 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,...

3 Sequence Prediction x is a sequence over countable alphabet X µ(x) is the µ-probability of observing x µ(a x) is the µ-probability of observing a X given x M is a countable set of measures (sequence generators) Hellinger 2 Distance h x (ρ, µ) := ( ρ(a x) ) 2 a X µ(a x) Total Variation Distance δ x (ρ, µ) := 1 2 a X ρ(a x) µ(a x) hx (ρ, µ) δ x (ρ, µ) X = {heads, tails} X = {rain, shine} X = {plague, smallpox, flu} X = {1, 2, 3, } Goal: Construct predictor ρ such that for all µ M when x is sampled from µ ρ( x <t ) fast µ( x <t )

4 Example X = {heads, tails} M = {µ θ } is a countable set of Bernoulli measures (coins) ρ(heads x <t ) = number of heads in x<t number of observations=t 1 Theorem (Law of Large Numbers) If µ M, then lim t h x<t (ρ, µ) = 0 with µ-probability 1

5 Example (continued) Theorem (Corollary of Hoeffding Bound) If µ M, then with µ-probability at least 1 δ ( t) ρ(heads x <t ) µ(heads x <t ) 1 2t(t + 1) log 2t δ Frequentist Confidence Optimal Confidence Error Time δ = 1 /10 and θ = 1 /2

6 Bayesian Predictors No assumption on M except that it is countable w : M (0, 1] is a prior on M ξ(x) := ν M w(ν)ν(x) is the Bayes measure Theorem (Solomonoff & Hutter) If µ M, then lim t h x<t (µ, ξ) = 0 with µ-probability 1 E µ t=1 h x <t (µ, ξ) ln 1 w(µ) Done? Not quite, h x<t (µ, ξ) is unknown. Can you construct a confidence bound on the error like in the Bernoulli case?

7 Posterior Convergence Theorem (Bayes Law) P (H D) = P (H) P (D H) P (D) Posterior belief in hypothesis ν M having observed x is w(ν x) := w(ν) ν(x) ξ(x) Conjecture The posterior w( x) concentrates about the truth as data is observed

8 Posterior Convergence Bayes predictive distribution 1/ Posterior at time 1 21 Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21

9 Posterior Convergence Bayes predictive distribution 1/ Posterior at time 2 21 Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21

10 Posterior Convergence Bayes predictive distribution 1/ Posterior at time 3 21 Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21

11 Posterior Convergence Bayes predictive distribution 1/ Posterior at time Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21

12 Posterior Convergence Bayes predictive distribution 1/ Posterior at time Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21

13 Bayesian Confidence Theorem (Ville) With µ-probability at least 1 δ ( n), w(µ x <n ) δw(µ) With high probability the posterior belief in true hypothesis µ never falls below δw(µ) Define set of plausible environments M(x <n ) := {ν M : η n, w(ν x <η ) δw(ν)} and confidence bound on error Theorem ĥ(x <n ) := max {h x<n (ν, ξ) : ν M(x <n )} With µ-probablity at least 1 δ it holds that ĥ(x <n) h x<n (µ, ξ) for all n

14 Posterior Convergence Bayes predictive distribution ν M(x <t ) maximising h x<t (ν, ξ) 1/ Posterior at time 1 21 Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21 and δ = 1 /10

15 Posterior Convergence Bayes predictive distribution ν M(x <t ) maximising h x<t (ν, ξ) 1/ Posterior at time 2 21 Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21 and δ = 1 /10

16 Posterior Convergence Bayes predictive distribution ν M(x <t ) maximising h x<t (ν, ξ) 1/ Posterior at time 3 21 Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21 and δ = 1 /10

17 Posterior Convergence Bayes predictive distribution ν M(x <t ) maximising h x<t (ν, ξ) 1/ Posterior at time Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21 and δ = 1 /10

18 Posterior Convergence Bayes predictive distribution ν M(x <t ) maximising h x<t (ν, ξ) 1/ Posterior at time Bernoulli measures with true θ = 1 /2 and w(ν) = 1 /21 and δ = 1 /10

19 Bayesian Confidence M(x <n ) := {ν M : η n, w(ν x <η ) δw(ν)} ĥ(x <n ) := max {h x<n (ν, ξ) : ν M(x <n )} Frequentist Confidence Bayes Confidence Optimal Confidence Error δ = 1 /10 and θ = 1 /2. Time

20 Bayesian Confidence M(x <n ) := {ν M : η n, w(ν x <η ) δw(ν)} ĥ(x <n ) := max {h x<n (ν, ξ) : ν M(x <n )} Theorem If w is uniform, then with µ-probability at least 1 δ n=1 ( ĥ(x <n ) M ln M + ln M ) δ Therefore ĥ(x <n) converges fast to zero

21 Knows What It Knows Framework Environment agent fails no h x<t (ρ, µ) ɛ? yes present x t to agent Start choose µ M output ρ( x <t) output Agent yes Am I confident? no

22 Knows What It Knows Algorithm 1 Inputs: ε, δ and M := { ν 1, ν 2,, ν M }. 2 t 1 and x <t ɛ and w : M [0, 1] is uniform 3 loop 4 if ĥt(x <t ) ε then 5 output ξ( x <t ) 6 else 7 output 8 observe x t and t t + 1 Theorem The following hold: 1 The agent fails with probability at most δ 2 The number of times action is taken is at most ( M O log M ) ɛ δ with probability at last 1 δ

23 Summary Constructed frequentist-style confidence intervals for discrete non-i.i.d. Bayes Works well in theory and in practise Leads to state-of-the-art bounds for KWIK learning Generic and applicable elsewhere (Bandits/RL) Also have bounds for KL divergence Countable case also covered

Online Prediction: Bayes versus Experts

Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,