Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Size: px
Start display at page:

Download "Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed"

Transcription

1 CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar Information-theoretic lower bounds for multiarmed bandits 1.1 KL divergence The central notion in this lower bound proof as in many information-theoretic lower bound proofs is KL divergence. This is a measure of the statistical distinguishability of two probability distributions on the same set. Definition 1. Let Ω be a finite set with two probability measures p, q. Their Kullback- Leibler divergence, or KL-divergence, is the sum KL(p; q) = ( ) p(x) p(x) ln, q(x) x Ω with the convention that p(x) ln(p(x)/q(x)) is interpreted to be 0 when p(x) = 0 and + when p(x) > 0 and q(x) = 0. If Y is a random variable defined on Ω and taking values in some set Γ, the conditional Kullback-Leibler divergence of p and q given Y is the sum KL(p; q Y ) = ( ) p(x Y = Y (x)) p(x) ln, q(x Y = Y (x)) x Ω where terms containing log(0) or log( ) are handled according to the same convention as above. Remark 1. Some authors use the notation D(p q) instead of KL(p; q). In fact, the D(p q) notation was adopted when this material was presented in class. In these notes we will use the KL(p; q) notation, which is more convenient for expressing conditioning. One should visualize KL-divergence as a measure of certainty that observed data is coming from some true distribution p as opposed to a counterfactual distribution q. If a sample point x Ω represents the observed data, then the log-likelihood-ratio ln(p(x)/q(x)) is a measure of how much more likely we are to observe the data under distribution p than under distribution q. The KL-divergence is the expectation of this W9-1

2 log-likelihood-ratio when the random sample x actually does come from distribution p. The following lemma summarizes some standard facts about KL-divergence; for proofs, see Cover and Thomas s book Elements of Information Theory. Lemma 1. Let p, q be two probability measures on a measure space (Ω, F) and let Y be a random variable defined on Ω and taking values in some finite set Γ. Define a pair of probability measures p Y, q Y on Γ by specifying that p Y (y) = p(y = y), q Y (y) = q(y = y) for each y Γ. Then and KL(p; q Y ) is non-negative. KL(p; q) = KL(p; q Y ) + KL(p Y ; q Y ), The equation given in the lemma is sometimes called the chain rule for KL divergence. Its interpretation is as follows: the amount of certainty we gain by observing a pair of random variables (X, Y ) is equal to the amount of certainty we gain by observing Y alone, plus the additional amount of certainty we gain by observing X, conditional on Y. The KL-divergence of two distributions can be thought of as a measure of their statistical distinguishability. We will need three lemmas concerning KL-divergence. The first lemma asserts that a sequence of n experiments can not be very good at distinguishing two possible distributions if none of the individual experiments is good at distinguishing them. The second one shows that if the KL-divergence of p and q is small, then an event which is reasonably likely under distribution p can not be too unlikely under distribution q. The third lemma estimates the KL-divergence of distributions which are very close to the uniform distribution on {0, 1}. Lemma. Suppose Ω 0, Ω 1,..., Ω n is a sequence of finite probability spaces, and suppose we are given two probability measures p i, q i on Ω i (0 i n) and random variables Y i : Ω i Ω i 1 such that p i 1 = (p i ) Yi, q i 1 = (q i ) Yi for i = 1,,..., n. If p 0 = q 0 and KL(p i ; q i Y i ) < δ for all i, then KL(p n ; q n ) < δn. Proof. The proof is by induction on n, the base case n = 1 being trivial. For the induction step, Lemma 1 implies KL(p n ; q n ) = K(p n ; q n Y n ) + KL(p n 1 ; q n 1 ) < δ + KL(p n 1 ; q n 1 ), and the right side is less than δn by the induction hypothesis. Lemma 3. If p, q are two distributions on Ω, then KL(p; q) p q 1. (1) W9-

3 Proof. For any event A with p(a) = a, q(a) = b, if Y is the indicator random variable of A then Lemma 1 ensures that while KL(p; q) KL(p(y); q(y)) ( a ) ( 1 a = log a + log b 1 b b ( 1 a = 1 x a ) dx x = geq a b a b a x a x(1 x) dx [(p(a) q(a))] = 4(a b) = ) (1 a) 4(x a) dx () b a 8(x a) dx. (3) If b a this confirms (1). If b < a we rewrite the right sides of () and (3) as a 4(a x) dx and a 8(a x) dx to make the intervals properly oriented and the b b integrands non-negative, and again (1) follows. Lemma 4. If 0 < ε < 1/ and p, q, r are the distributions on {0, 1} defined by p(1) = 1 + ε q(1) = 1 p(0) = 1 ε q(0) = 1 then KL(p; q) < ε and KL(p; r) < 4ε. Proof. KL(p; q) = 1 + ε KL(p; r) = 1 + ε ln(1 + ε) + 1 ε r(1) = 1 ε r(0) = 1 + ε ln(1 ε) ) = 1 ln(1 ε ) + ε ( ln 1 + ε 1 ε ( ) ε < ε < ε. 1 ε ( ) 1 + ε ln + 1 ε ln 1 ε ( (1 = 1 ln(1) + ε ) ) + ε ln 1 ε ( = ε ln 1 + ε ) 1 ε < 4ε. W9-3 ( ) 1 ε 1 + ε

4 1. Distinguishing coins Suppose that I give you two coins: one fair, the other with bias 1 ε. Your job is to repeatedly choose one coin to flip, and to stop when you think you know which one is biased. Your answer should be correct with probability at least How many flips does it take to identify the biased coin? The answer is O(1/ε ). While it is possible to prove this by elementary means, here we will give a proof using KL-divergence, which has the benefit of generalizing to the case where there are more than two coins. Theorem 5. Let p, q be two distributions on {0, 1} m such that the m bits are independent, with expected value 1 ε under p and 1 under q. If 16m 1/ε and A is any event, then p(a) q(a) 1/. Proof. We have Also p(a) q(a) 1 p q 1 KL(p; q) = i 1 KL(p; q). KL(p(x i ); q(x i ) x 1,..., x i 1 ). This is bounded above by 8ε m. Here s the generalization to more than coins. We have n coins and we have an algorithm which chooses, at time t (1 t T ), a coin x t to flip. It also outputs a guess y t ; the guess is considered correct if y t is a biased coin. The choice of x t, y t is only allowed to depend on the outcomes of the first t 1 coin flips. Consider the following distributions on coin-flip outcomes. Distribution p 0 is the distribution in which all coins are fair. Distribution p j is the distribution in which coin j has bias 1 ε while all others are fair coins. In all these distributions p 0, p 1,..., p n, the different coin flips are mutually independent events. When 0 j n, we will denote the probability of an event under distribution p j by Pr j. Similarly, the expectation of an event under distribution p j will be denoted by E j. Theorem 6. Let ALG be any coin-flipping algorithm. If 100t n/ε then there exist at least n/3 distinct values of j > 0 such that Pr j (y t j) 1/. Proof. The intuition is as follows. Let Q j denote the random variable which counts the number of times ALG flips coin j. If E j (Q j ) is much smaller than 1/ε, then at time t the algorithm is unlikely to have accumulated enough evidence that j is the biased coin. On the other hand, since there are n coins and t n/ε, for most values of j is it unlikely that the algorithm flips coin j more than 1/ε times before time t. W9-4

5 To make this precise, first note that n j=1 E 0(Q j ) = t by linearity of expectation. Hence the set J 1 = {j : E 0 (Q j ) 3t/n} has at least n/3 elements. Moreover, the set J = {j : Pr 0 (y t = j) 3/n} has at least n/3 elements. Let J = J 1 J ; this set has at least n/3 elements. If j J and E is the event y t = j then Pr j (E) Pr 0 (E) + Pr j (E) Pr 0 (E) Pr 0 (E) + 1 p 0 p j 1 3 n + 1 KL(p 0; p j ). Moreover, by Lemma and 4, KL(p 0 ; p j ) 8ε E 0 (Q j ) 4ε t/n < 1/4. (4) 1.3 The multi-armed bandit lower bound Let MAB be any algorithm for the n-armed bandit problem. We will describe a procedure for generating a random input such that the expected regret accumulated by MAB when running against this random input is Ω( nt ). Define distributions p 0, p 1,..., p n on ({0, 1} n ) T as follows. First, p 0 is the uniform distribution on this set. Second, for 1 j n, p j is the distribution in which the random variables c t (i) are mutually independent, and { 1/ if i j Pr(c t (i)) = 1/ ε if i = j. Here ε = 1 10 n/t, so that 100T n/ε. The random input is generated by sampling a number j uniformly at random from [n] and then sampling the sequence of cost functions according to p j. Remark. There is nothing surprising that a lower bound for oblivious adversaries against randomized algorithms comes from looking at a distribution over inputs. Yao s Lemma says that it must be possible to prove the lower bound this way. However, there are two surprising things about the way this lower bound is established. First, conditional on the random number j, the samples are independent and identically distributed. There s nothing in Yao s Lemma which says that the worst-case distribution of inputs has to be such a simple distribution. Second, the lower bound for oblivious adversaries nearly matches the upper bound for adaptive W9-5

6 adversaries. It appears that almost all of the adversary s power comes from the ability to tailor the value of ε to the time horizon T. Note that there is still a logarithmic gap between the lower and upper bounds. Closing this gap is an interesting open question. Finally, the most important thing to appreciate about this lower bound (besides the proof technique, which serves as a very good demonstration of the power of KL divergence) is that it pins down precisely which multi-armed problems are the toughest: those in which the n strategies have nearly identical payoff distributions but one of them is just slightly better than the rest. Consider the following coin-flipping algorithm based on MAB. At time t, when MAB chooses strategy x t, the coin-flipping algorithm chooses to flip coin x t and guesses y t = x t as well. The previous theorem about coin-flipping algorithms proves that there exists a set J t with at least n/3 distinct elements, such that if we run this coin-flipping algorithm then for all t (1 t T ) and j J t, Pr j (x t = j) 1/, which implies that E [c t (x t ) j J t ] 1 ( ) ( ) 1 ε 1 ε. (5) Trivially, E [c t (x t ) j J t ] 1 ε. (6) Recalling that J t n/3, we have Pr(j J t ) 1/3, so E [c t (x t )] 1 ( 1 3 ε ) + ( ) 1 3 ε = 1 5ε 6. (7) Hence while [ T ] E c t (j ) = T εt t=1 [ T ] E c t (x t ) t=1 T 5εT 6. It follows that the regret of MAB is at least εt 6 = 1 60 nt. Markov decision processes Up to this point, our treatment of multi-armed bandit problems has focused on worst-case analysis of algorithms. Historically, the first approach to multi-armed bandit problems was grounded in average-case analysis of algorithms; this is still the most influential approach to multi-armed bandit algorithms. It assumes that the W9-6

7 decision-maker has a prior distribution which is a probability measure on the set of possible input sequences. The task is then to design a Bayesian optimal bandit algorithm, i.e. one which optimizes the expected cost of the decision sequence, assuming that the actual input sequence is a random sample from the prior distribution. Under a convenient assumption (namely, that costs are geometrically timediscounted) this problem belongs to a class of planning problems called Markov decision problems or simply MDP s. MDP s are much more general than multi-armed bandit problems, and they constitute an extremely important topic in artificial intelligence. See sutton/book/ebook/the-book.html wfor a textbook with excellent coverage of the subject. In these notes we will explore the most basic elements of the theory of MDP s, with the aim of laying the technical foundations for the Gittins index theorem, a theorem which describes the Bayesian optimal bandit algorithm when the prior distribution over the n strategies is a product distribution (i.e. the prior belief is that different strategies are uncorrelated) and the costs are geometrically time-discounted..1 Definitions Definition. A Markov decision process (MDP) is specified by the following data: 1. a finite set S of states;. a finite set A of actions; 3. transition probabilities P a s,s for all s, s S, a A, specifying the probability of a state transition from s to s given that the decision-maker selects action A when the system is in state s; 4. costs c(s, a) R + specifying the cost of choosing action a A when the system is in state s. Some authors define MDP s using payoffs or rewards instead of costs, and this change of terminology implies changing the objective from cost minimization to payoff or reward maximization. Also, some authors define the costs/payoffs/rewards to be random variables. Here we have opted to make them deterministic. From the standpoint of solving the expected-cost-minimization problem, it does not matter whether c(s, a) is defined to be a random variable or to be the expectation of that random variable. For simplicity, we have opted for the latter interpretation. Definition 3. A policy for an MDP is a rule for assigning a probability π(s, a) [0, 1] to each state-action pair (s, a) S A such that for all s S, a A π(s, a) = 1. A pure policy is a policy such that π(s, a) {0, 1} for all s S, a A. If π is a pure policy, then for every s S there is a unique a A such that π(s, a) = 1; we will sometimes denote this unique value of a as π(s), by abuse of notation. W9-7

8 Definition 4. A realization of an MDP (S, A, P, c) with policy π is a probability space Ω with the following collection of random variables: a sequence s 0, s 1,... taking values in S; a sequence a 0, a 1,... taking values in A. These random variables are required to obey the specified transition probabilities, i.e. Pr(s t+1 = s s 0, s 1,..., s t, a 0, a 1,..., a t ) = Pr(s t+1 = s s t, a t ) = P at s t,s. Given a policy π for an MDP, there are a few different ways to define the cost of using policy π. One way is to set a finite time horizon T and to define the cost of using π starting from state s S to be the function [ T ] V π (s) = E c(s t, a t ) s 0 = s. t=0 Another way is to define the cost as an infinite sum using geometric time discounting with some discount factor γ < 1: [ T ] V π (s) = E γ t c(s t, a t ) s 0 = s. t=0 The following general definition incorporates both of these possibilities and many others. Definition 5. A stopping time τ for a MDP is a random variable defined in a realization of the MDP, taking values in N {0}, which satisfies the property: Pr(τ = t s 0, s 1,..., a 0, a 1,...) = Pr(τ = t s 0, s 1,..., s t, a 0, a 1,..., a t ). If τ satisfies the stronger property that there exists a function p : S [0, 1] such that Pr(τ = t s 0, s 1,..., a 0, a 1,...) = p(s t ), then we say that τ is a memoryless stopping time and we call p the stopping probability function. Given a MDP with policy π and stopping time τ, the cost of π is defined to be the function [ τ 1 ] V π (s) = E c(s t, a t ) s 0 = s. t=0 This is also called the value function of π. W9-8

9 For example, a finite time horizon T is encoded by setting the stopping time τ to be equal to T + 1 at every point of the sample space Ω. Geometric time discounting with discount factor γ < 1 is encoded by setting τ to be a geometrically distributed random variable which is independent of the random variables s 0, s 1,... and a 0, a 1,..., i.e. a random variable satisfying Pr(τ > t s 0, s 1,..., a 0, a 1,...) = γ t for all t N {0}, s 0, s 1,... S, a 0, a 1,... A. Definition 6. Given a set U S, the hitting time of U is a stopping time τ which satisfies τ = min{t s t U} whenever the right side is defined. If there is a positive probability that the infinite sequence s 0, s 1, s,... never visits U, then the hitting time of U is not a well-defined stopping time. However, we will always be considering sets U such that the hitting time is well-defined. Note that the hitting time of U is a memoryless stopping time whose stopping probability function is p(s) = 1 if s U, 0 otherwise. Given a MDP with a memoryless stopping time τ, we may assume (virtually without loss of generality) that τ is the hitting time of U, for some set of states U S. This is because we may augment the MDP by adjoining a single extra state, Done, and defining the transition probabilities ˆP and costs ĉ as follows (where p denotes the stopping probability function of τ): (1 p(s)) P ˆP s,s a = s,s a if s, s S p(s) if s S, s = Done 1 if s = s = Done. { c(s, a) if s S ĉ(s, a) = 0 otherwise. There is a natural mapping from policies for the augmented MDP to policies for the original MDP and vice-versa: given a policy ˆπ for the augmented MDP one obtains a policy π for the original MDP by restricting ˆπ to the state-action pairs in S A; given a policy π for the original MDP one obtains ˆπ to be an arbitrary policy whose restriction to S A is equal to π. Both of these natural mappings preserve the policy s value function. In that sense, solving the original MDP (i.e. identifying a policy of minimum cost) is equivalent to solving the augmented MDP. This is what we mean when we say that a memoryless stopping rule is, without loss of generality, equal to the hitting time of some set U.. Examples To illustrate the abstract definition of Markov decision processes, we will give two examples in this section. A third illustration is contained in the following section, which explains how MDP s model an important class of bandit problems. W9-9

10 Example 1 (Blackjack with an infinite deck). If one is playing blackjack with an infinite deck of cards (such that the probability of seeing any given type of card is 1/5 regardless of what cards have been seen before) then the game is a MDP whose states are ordered triples (H, B, F ) where H is a multiset of cards (the contents of the player s current hand), B > 0 is the size of the player s current bet, and F {0, 1} specifies whether the player is finished receiving new cards into his or her hand. The set of actions is {hit,stand,double}. In state (H, B, 0), if the player chooses stand then the next state is (H, B, 1) with probability 1. If the player chooses double then the next state is (H, B, 0) with probability 1. If the player chooses hit then a random card is added to H, B remains the same, and F changes from 0 to 1 if the sum of the values in H now exceeds 1, otherwise F remains at 0. The stopping time is the hitting time of the set of states such that F = 1. (Consequently it doesn t matter how we define the transition probabilities in such states, though for concreteness we will say that any action taken in such a state leads back to the same state with probability 1.) The cost of taking an action that leads to a state (H, B, 0) is 0; the cost of taking an action that leads to a state (H, B, 1) is B times the probability that the dealer beats a player whose hand is H. (Technically, we are supposed to define the cost as a function of the action and the state immediately preceding that action. Thus we should really define the cost of taking action a in state s to be equal to B times the probability that a leads to a state with F = 1 and the dealer beats the player in this state.) If the deck is not infinite and the player is counting cards, then to model the process as an MDP we must enlarge the state space to include the information that the player recalls about cards that have been dealt in the past. Example (Playing golf with n golf balls). The game of golf with n golf balls is played by a single golfer using n golf balls on a golf course with a finite set of locations where a ball may come to rest. One of these locations is the hole, and the objective is to get at least one of the n balls to land in the hole while minimizing the total number of strokes (including strokes that involved hitting other balls besides the one which eventually landed in the hole). We can model this as a MDP, as follows. Let L be the set of locations, and let h L denote the hole. The set of states of the MDP is L n and the stopping time is the hitting time of the set U = {(l 1, l,..., l n ) L n i such that l i = h}. The set of actions is [n] C where C is the set of golf clubs that the golfer is using. The interpretation of action (i, c) is that the golfer uses club c to hit ball number i. When the golfer takes action (i, c), the state updates from (l 1,..., l n ) to a random new state (l 1,..., l i 1, l i, l i+1,..., l n ), where the probability of hitting ball i from l to l using club c is a property of the golfer and the ball which the golfer is hitting (but it does not depend on the time at which the golfer is hitting the ball, nor on the positions of the other balls on the golf course). W9-10

11 .3 How is this connected to multi-armed bandits? Let F denote a family of probability measures on R. For example F may be the family of all Gaussian distributions, or F may be the family of all distributions supported on the two-element set {0, 1}. Consider a multi-armed bandit problem with strategy set S = [n], in which the decision-maker believes that each strategy i [n] has a costs distributed according to some unknown distribution f i F and that these unknown distributions f 1, f,..., f n are themselves independent random variables distributed according to n known probability measures µ 1, µ,..., µ n on F. To put it more precisely, the decision-maker s prior belief distribution can be described as follows. There are n random variables f 1, f,..., f n taking values in F; they are distributed according to the product distribution µ 1... µ n, and the costs c t (i) are mutually conditionally independent (conditioned on f 1, f,..., f n ) and satisfy Pr(c t (i) B f 1, f,..., f n ) = f i (B) for every Borel set B R. Let us assume, moreover, that there is a fixed discount factor γ < 1 and that the decision-maker wishes to choose a sequence of strategies x 1, x,... so as to minimize the expected time-discounted cost [ ] E γ t c t (i t ) t=0 where the expectation is with respect to the decision-maker s prior. This problem can be modeled as a Markov decision process with an infinite state space. Specifically, a state of the MDP is an n-tuple of beliefs ν 1, ν,..., ν n each of which is a probability measure on F representing the decision-maker s posterior belief about the cost distribution of each strategy, after performing some number of experiments and observing their outcomes. The set of actions is simply [n]; performing action x at time t in the MDP corresponds to choosing strategy x in step t of the bandit problem. The transition probabilities of the MDP are determined by Bayes law, which specifies how to update the posterior distribution for strategy x after making one observation of the cost c t (x). Note that when the decision-maker chooses action x in state (ν 1,..., ν n ), the resulting state transition only updates the x-th component of the state vector. (Our assumption that the cost distributions of the different strategies are independent ensures that a Bayesian update after observing c t (x) has no effect on ν y for y x.) The cost of choosing action x in state (ν 1,..., ν n ) is simply the conditional expectation E[c t (x) ν x ], i.e. it is the expected value of a random sample from distribution ν x. Note the similarity between this example and the golfing with n golf balls example. Both problems entail studying MDP s in which the states are represented as n-tuples, and each action can only update one component of the n-tuple. In fact, if W9-11

12 one generalizes the golfing problem to include golf courses with uncountably many locations and rules in which the cost of a stroke depends on the ball s location at the time it was hit, then it is possible to see the bandit problem as a special case of the golfing problem..4 Properties of optimal policies In this section we prove a sequence of three theorems which characterize optimal policies of MDP s and which establish that every MDP has an optimal policy which is a pure policy. Before doing so, it will be useful to introduce the notation Q π (s, a), which denotes the expected cost of performing action a at time 0 in state s, and using policy π in every subsequent time step. [ τ ] Q π (s, a) = c(s, a) + E c(s t, a t ) s 0 = s, a 0 = a. t=1 By abuse of notation, for a policy π we also define Q π (s, π ) to be the weighted average Q π (s, π ) = a A π (s, a)q π (s, a). Note that Q π (s, π) = V π (s) for any policy π and state s. Theorem 7 (Policy improvement theorem). If π, π are policies such that for every state s, then Q π (s, π) Q π (s, π ) (8) V π (s) V π (s) (9) for every state s. If the inequality (8) is strict for at least one s, then (9) is also strict for at least one s. Proof. For any t 0, let π < t > denote the hybrid policy which distributes its actions according to π at all times s < t and distributes its actions according to π at all times s t. (Technically, this does not satisfy our definition of the word policy since the distribution over actions depends not only on the current state but on the time as well. However, this abuse of terminology should not cause confusion.) For every t 0 and s S we have V π<t+1> (s) V π<t> (s) = s S 0. (Q π (s, π ) Q π (s, π)) Pr(s t = s s 0 = s) W9-1

13 The theorem now follows by observing that V π<0> (s) = V π (s) and that lim V π<t> (s) = V π (s). t Definition 7. A policy π for a MDP is optimal if V π (s) V π (s) for every state s and policy π. Theorem 8 (Bellman s optimality condition). A policy π is optimal if and only if it satisfies a arg min b A Qπ (s, b) (10) for every state-action pair (s, a) such that π(s, a) > 0. Proof. By Theorem 7, if π fails to satisfy (10) for some state-action pair (s, a) such that π(s, a) > 0, then π is not an optimal policy. This is because we may construct a different policy π such that π (s) is a probability distribution concentrated on the set arg min b A Q π (s, b), and π (s ) = π(s ) for all states s s. This new policy π satisfies Q π (s, π) > Q π (s, π ), and Q π (s, π) = Q π (s, π ) for all s s; hence by Theorem 7 there is some state in which V π is strictly less than V π, hence π is not optimal. To prove the converse, assume π is not optimal and let σ be an optimal policy. Also assume (without loss of generality) that the stopping time τ is the hitting time of some set U S. Let x = max (V π (s) V σ (s)), s S and let T = {s S V π (s) = V σ (s) + x}. Notice that V π (s) = V σ (s) = 0 for all s U, hence T is disjoint from U. Thus there must be at least one state s T such that the probability of a state transition from s to the complement of T, when playing with policy σ, is strictly positive. Now, Q π (s, π) = V π (s) = V σ (s) + x ( = a A σ(s, a) > a A σ(s, a) ( = a A σ(s, a)q π (s, a) = Q π (s, σ), hence π must not satisfy (10). c(s, a) + s S P a s,s (V σ (s ) + x) c(s, a) + s S P a s,s (V π (s )) W9-13 ) )

14 Theorem 9 (Existence of pure optimal policies). For every MDP, there is a pure policy which is optimal. Proof. There are only finitely many pure policies, so among all pure policies there is at least one policy π which minimizes the sum s S V π (s). We claim that this policy π is an optimal policy. Indeed, if π is not an optimal policy then by Theorem 8 there is a state-action pair (s, a) such that π(s, a) > 0 but a arg min b A Q π (s, b). Then, by Theorem 7, if we modify π to a new policy π by changing π(s) from a to any action a arg min b A Q π (s, b), then this new policy π satisfies V π (s ) V π (s ) for all states s S, with strict inequality for at least one such state. This contradicts our assumption that π minimizes s S V π (s) among all pure policies..5 Computing optimal policies Theorem 8 actually implies an algorithm for computing an optimal policy of a MDP in polynomial time, by solving a linear program. Namely, consider the linear program: max V (s) s S s.t. V (s) = 0 s U V (s) c(s, a) + s S P a s,s V (s ) s S \ U, a A. It is easy to check the following facts. 1. If V is a solution of the linear program, then for all s S \ U, a A. V (s) = min c(s, a) + Ps,s a V a A (s ) s S. If π is a pure policy obtained by selecting π(s) arg min a A c(s, a)+ s S P a s,s V (s ) for all s S, a A, then V is the value function of π. 3. The pure policy π defined in this manner satisfies the Bellman optimality condition, and is therefore an optimal policy. In practice, there are iterative algorithms for solving MDP s which are much more efficient than the reduction from MDP s to linear programming presented here. For more information on these other methods for solving MDP s, we refer the reader to sutton/book/ebook/the-book.html. W9-14

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Solution Set for Homework #1

Solution Set for Homework #1 CS 683 Spring 07 Learning, Games, and Electronic Markets Solution Set for Homework #1 1. Suppose x and y are real numbers and x > y. Prove that e x > ex e y x y > e y. Solution: Let f(s = e s. By the mean

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Module 1. Probability

Module 1. Probability Module 1 Probability 1. Introduction In our daily life we come across many processes whose nature cannot be predicted in advance. Such processes are referred to as random processes. The only way to derive

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

the time it takes until a radioactive substance undergoes a decay

the time it takes until a radioactive substance undergoes a decay 1 Probabilities 1.1 Experiments with randomness Wewillusethetermexperimentinaverygeneralwaytorefertosomeprocess that produces a random outcome. Examples: (Ask class for some first) Here are some discrete

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Sequential Decisions

Sequential Decisions Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Final You have approximately 2 hours 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

CS 4649/7649 Robot Intelligence: Planning

CS 4649/7649 Robot Intelligence: Planning CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Biology as Information Dynamics

Biology as Information Dynamics Biology as Information Dynamics John Baez Stanford Complexity Group April 20, 2017 What is life? Self-replicating information! Information about what? How to self-replicate! It is clear that biology has

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

Sample Spaces, Random Variables

Sample Spaces, Random Variables Sample Spaces, Random Variables Moulinath Banerjee University of Michigan August 3, 22 Probabilities In talking about probabilities, the fundamental object is Ω, the sample space. (elements) in Ω are denoted

More information

Biology as Information Dynamics

Biology as Information Dynamics Biology as Information Dynamics John Baez Biological Complexity: Can It Be Quantified? Beyond Center February 2, 2017 IT S ALL RELATIVE EVEN INFORMATION! When you learn something, how much information

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Construction of a general measure structure

Construction of a general measure structure Chapter 4 Construction of a general measure structure We turn to the development of general measure theory. The ingredients are a set describing the universe of points, a class of measurable subsets along

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Understanding (Exact) Dynamic Programming through Bellman Operators

Understanding (Exact) Dynamic Programming through Bellman Operators Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 15, 2019 Ashwin Rao (Stanford) Bellman Operators January 15, 2019 1 / 11 Overview 1 Value

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Basic Probability. Introduction

Basic Probability. Introduction Basic Probability Introduction The world is an uncertain place. Making predictions about something as seemingly mundane as tomorrow s weather, for example, is actually quite a difficult task. Even with

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

If the objects are replaced there are n choices each time yielding n r ways. n C r and in the textbook by g(n, r).

If the objects are replaced there are n choices each time yielding n r ways. n C r and in the textbook by g(n, r). Caveat: Not proof read. Corrections appreciated. Combinatorics In the following, n, n 1, r, etc. will denote non-negative integers. Rule 1 The number of ways of ordering n distinguishable objects (also

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning

Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Freie Universität Berlin Fachbereich Mathematik und Informatik Master Thesis Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Michael Borst Advisor:

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Notes from Week 8: Multi-Armed Bandit Problems

Notes from Week 8: Multi-Armed Bandit Problems CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 8: Multi-Armed Bandit Problems Instructor: Robert Kleinberg 2-6 Mar 2007 The multi-armed bandit problem The multi-armed bandit

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time 26.11.2015 Fachbereich Informatik Knowledge Engineering Group David Fischer 1 Table of Contents Problem and Motivation Algorithm

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Compute the Fourier transform on the first register to get x {0,1} n x 0. CS 94 Recursive Fourier Sampling, Simon s Algorithm /5/009 Spring 009 Lecture 3 1 Review Recall that we can write any classical circuit x f(x) as a reversible circuit R f. We can view R f as a unitary

More information

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ). CS 70 Discrete Mathematics for CS Spring 2006 Vazirani Lecture 8 Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According to clinical trials,

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008 Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008 1 Review We saw some basic metrics that helped us characterize

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017 Probability Theory The theory of probability is a system for making better guesses.

More information

Probability theory basics

Probability theory basics Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:

More information

MORE ON CONTINUOUS FUNCTIONS AND SETS

MORE ON CONTINUOUS FUNCTIONS AND SETS Chapter 6 MORE ON CONTINUOUS FUNCTIONS AND SETS This chapter can be considered enrichment material containing also several more advanced topics and may be skipped in its entirety. You can proceed directly

More information

Some Fixed-Point Results for the Dynamic Assignment Problem

Some Fixed-Point Results for the Dynamic Assignment Problem Some Fixed-Point Results for the Dynamic Assignment Problem Michael Z. Spivey Department of Mathematics and Computer Science Samford University, Birmingham, AL 35229 Warren B. Powell Department of Operations

More information

Predictive Processing in Planning:

Predictive Processing in Planning: Predictive Processing in Planning: Choice Behavior as Active Bayesian Inference Philipp Schwartenbeck Wellcome Trust Centre for Human Neuroimaging, UCL The Promise of Predictive Processing: A Critical

More information

Inference for Stochastic Processes

Inference for Stochastic Processes Inference for Stochastic Processes Robert L. Wolpert Revised: June 19, 005 Introduction A stochastic process is a family {X t } of real-valued random variables, all defined on the same probability space

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( ) Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the

More information

Algorithmic Game Theory and Applications. Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games

Algorithmic Game Theory and Applications. Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami warning 1 The subjects we will touch on today are so interesting and

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Lecture notes for the course Games on Graphs B. Srivathsan Chennai Mathematical Institute, India 1 Markov Chains We will define Markov chains in a manner that will be useful to

More information

Fundamentals of Probability CE 311S

Fundamentals of Probability CE 311S Fundamentals of Probability CE 311S OUTLINE Review Elementary set theory Probability fundamentals: outcomes, sample spaces, events Outline ELEMENTARY SET THEORY Basic probability concepts can be cast in

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

Lecture 8: Probability

Lecture 8: Probability Lecture 8: Probability The idea of probability is well-known The flipping of a balanced coin can produce one of two outcomes: T (tail) and H (head) and the symmetry between the two outcomes means, of course,

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information