Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Size: px

Start display at page:

Download "Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error."

Oswin Lester
5 years ago
Views:

Lecture 2: Learning from Evaluative Feedback Edward L. Thorndike #1874$1949% or!bandit Problems" Puzzle Box 1 2 Learning by Trial$and$Error Law of E&ect:!

1 Lecture 2: Learning from Evaluative Feedback Edward L. Thorndike #1874$1949% or!bandit Problems" Puzzle Box 1 2 Learning by Trial$and$Error Law of E&ect:!Of several responses to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, othehings being equal, be more 'rmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, othehings being equal, have their connections with that situation weakened, so that when it recurs, they will be less likely to recur." Thorndike, Some Distinctions Evaluating actions vs. instructing by giving correct actions Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. Supervised learning is instructive; optimization is evaluative Associative vs. Non-associative: Associative: inputs mapped to outputs; learn the best output for each input Non$associative:!learn" #'nd% one best output n$armed bandit #at least how we treat it% is: Non$associative Evaluative feedback 4 4

2 The n$armed Bandit Problem Choose repeatedly from one of n actions; each choice is called a play After each play, you get a reward, where E = Q * These are unknown action values Distribution of depends only on Objective is to Maximize the reward in the long term, e.g., over 1000 plays To solve the n$armed bandit problem, you must explore a variety of actions and the exploit the best of them 5 5 The Exploration/Exploitation Dilemma Suppose you form estimates Q t (a! Q * (a The greedy action at t is: You can(t exploit all the time; you can(t explore all the time You can never stop exploring; but you should always reduce exploring 6 action value estimates a * t = argmax Q t (a a = *! exploitation " *! exploration 6 Action$Value Methods Methods that adapt action$value estimates and nothing else, e.g.: suppose by the t$th play, action a had been chosen k a times, producing rewards r, r, K, r 1 2 k a, then lim Q (a = t k a! " Q* (a Q t (a = r 1 + r 2 + Lr k a k a!sample average"!$greedy Action Selection Greedy action selection:!$greedy: = = a * t = arg max Q t (a a * with probability 1! " random action with probability "... the simplest way to try to balance exploration and exploitation

10$Armed Testbed!$Greedy Methods on the 10 $Armed Testbed n = 10 possible actions Each Q * (a each is chosen randomly from a normal distribution:!(0,1 is also normal:!

3 10$Armed Testbed!$Greedy Methods on the 10 $Armed Testbed n = 10 possible actions Each Q * (a each is chosen randomly from a normal distribution:!(0,1 is also normal:!(q *, plays repeat the whole thing 2000 times and average the results Softmax Action Selection Binary Bandit Tasks Softmax action selection methods grade action probs. by estimated values. Suppose you have just two actions: = 1 or = 2 and just two rewards: = success or = failure The most common softmax uses a Gibbs, or Boltzmann, distribution: Choose action a on play t with probability " e Q t (a! n Qt (b! b=1 e where! is the!computational temperature", Then you might infer arget or desired action: d t = if success the other action if failure and then always play the action that was most often the target Call this the supervised algorithm It works 'ne on deterministic tasks

Contingency Space The space of all possible binary bandit tasks: Linear!Learning Automata" Let! t (a = Pr{ = a} be the only adapted parameter L R I (Linear, reward - inaction On success :! t +1 =!

4 Contingency Space The space of all possible binary bandit tasks: Linear!Learning Automata" Let! t (a = Pr{ = a} be the only adapted parameter L R I (Linear, reward - inaction On success :! t +1 =! t +"(1 #! t 0 < " < 1 (the other action probs. are adjusted to still sum to 1 On failure : no change L R -P (Linear, reward - penalty On success :! t +1 =! t +"(1 #! t 0 < " < 1 (the other action probs. are adjusted to still sum to 1 On failure :! t +1 =! t +"(0 #! t 0 < " < 1 Fowo actions, a stochastic, incremental version of the supervised algorithm Performance on Binary Bandit Tasks A and B Incremental Implementation Recall the sample average estimation method: The average of the 'rst k rewards is #dropping the dependence on a %: Q k = r 1 + r 2 +Lr k k Can we do this incrementally #without storing all the rewards%? We could keep a running sum and count, or, equivalently: Q k +1 = Q k + 1 [ k +1 r! Q k +1 k ] This is a common form for update rules: NewEstimate = OldEstimate + StepSize*Target OldEstimate

Tracking a Non$stationary Problem Optimistic Initial Values Choosing Q k to be a sample average is appropriate in a stationary problem, i.e., when none of the Q * (a change oveime, But not in a non$stationary problem.

[ r k +1 " Q k ] for constant!, 0 <! # 1 = (1"! k Q 0 + $!(1 "!

5 Tracking a Non$stationary Problem Optimistic Initial Values Choosing Q k to be a sample average is appropriate in a stationary problem, i.e., when none of the Q * (a change oveime, But not in a non$stationary problem. All methods so far depend on Q 0 (a, i.e., they are biased. Suppose instead we initialize the action values optimistically, i.e., on the 10$armed testbed, use Q 0 (a = 5 for all a Better in the non$stationary case is: Q k +1 = Q k +! [ r k +1 " Q k ] for constant!, 0 <! # 1 = (1"! k Q 0 + $!(1 "! k "i r i k i =1 exponential, recency-weighted average Reinforcement Comparison Compare rewards to a reference reward, average of observed rewards, e.g., an Performance of a Reinforcement Comparison Method Strengthen or weaken the action taken depending on! Let p t (a denote the preference for action a Preferences determine action probabilities, e.g., by Gibbs distribution:! t (a = Pr{ = a} = " e p t (a n pt (b e b=1 Then: p t +1 = p t (a +! [ ] and +1 = + "[! r ] t

Pursuit Methods Maintain both action$value estimates and action preferences Performance of a Pursuit Method Always!pursue" the greedy action, i.e., make the greedy action more likely to be selected Aftehe t$th play, update the action values to get Q t +1 The new greedy action is * +1 = argmax Q t +1 (a a Then: [ ] *!

6 Pursuit Methods Maintain both action$value estimates and action preferences Performance of a Pursuit Method Always!pursue" the greedy action, i.e., make the greedy action more likely to be selected Aftehe t$th play, update the action values to get Q t +1 The new greedy action is * +1 = argmax Q t +1 (a a Then: [ ] *! t =! t (a * * t +1 + " 1 #! t +1 and the probs. of the other actions decremented to maintain the sum of Associative Search Conclusions These are all very simple methods Imagine switching bandits at each play but they are complicated enough,we will build on them Bandit 3 Ideas for improvements: estimating uncertainties... interval estimation actions approximating Bayes optimal solutions Gittens indices The full RL problem o&ers some ideas for solution

7 Next Class Read Chapter 3 25

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those