Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen
[Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US) are paired in close succession - The US produces a response UR - After sufficient pairing, the CS produces a response CR similar to UR Example: rabbit eye blink condition - CS is sound of buzzer, US air puff to rabbit eye, UR is eye blink - After CS-US pairing, the buzzer alone yields CR eye blink Bert Kappen 1
[Sutton and Barto, 1990] Classical theory proposes that a causal relation between CS and US is learned. V = (level of US processing) (level of CS processing) = Reinforcement Eligibility V is association strength. - US processing is about reinforcement (positive or negative) - CS processing is about attention (which stimulus is credited) We focus on reinforcement processing with very simple eligibility models. Bert Kappen 2
Trial level vs. Real-time theories [Sutton and Barto, 1990] Trial level theories - ignore temporal aspects of individual trial and updates apply at end of trial - Rescorla-Wagner model - mostly used in experiments Real-time theories. - consider detailed timing aspects of the stimuli and rewards - TD learning Bert Kappen 3
Rescorla-Wagner (RW) model [Sutton and Barto, 1990] Central idea is that learning occurs whenever events violate expectations. In particular, when expected US V differs from the actual US level λ: V i := V i + V i V i = β (λ V) } {{ } V = reinforcement V i X i i α i X i }{{} saliency X i = 0, 1 to indicate absence/presence of CS i. α i is saliency (attention) of CS i (fixed). Term α i X i is often ignored for single stimulus. V i is the association strength of stimulus CS i to US. λ denotes the level of US, combining, in an unspecified way, the USs intensity, duration, and temporal relationship with the CS. 1 1 For instance, strenght, duration of air puff and timing relative to CS. Bert Kappen 4
RW is linear regression [Sutton and Barto, 1990] Consider sequence {(X i,t, λ t ), t = 1,..., T}. Define error C = 1 2 T λ t t=1 X i,t V i i 2 Minimizing C wrt V i gives best predictive model. Identical to RW rule with α i = 1. V i = β C = β V λ t i X i,t V i X i,t i Bert Kappen 5
Rescorla-Wagner model [Sutton and Barto, 1990] Rescola Wagner rule for multiple inputs can predict various phenomena: - Blocking: learned s 1 r prevents learning of association s 2 r - Inhibition: s 2 reduces prediction when combined with any predicting stimulus Bert Kappen 6
Rescorla-Wagner model [Sutton and Barto, 1990] Fails to explain second order conditioning: B A US - Empirical finding: if A predicts US and B predicts A, then B predicts US. - A US relation correctly reinforced by RW - B A relation not reinforced by RW because US λ = 0 in B A trials Solution is to assume that A also generates a λ Bert Kappen 7
Temporal aspect play important role [Sutton and Barto, 1990] Bert Kappen 8
Real-Time Theories of Eligibility [Sutton and Barto, 1990] Conditioning depends on time interval between CS and US. Mechanisms for coincidence of CS and US: - Stimulus trace (Hull 1939) is internal representation. Disadvantage of using stimulus trace for learning is that long inter-stimulus intervals (ISIs) require broad stimulus trace, which disagrees with fast, precisely timed responses. - Eligibility trace X (Knopf 1972) is similar to stimulus trace but only used for learning Bert Kappen 9
From RW to a real time theory [Sutton and Barto, 1990] Left :Trial based reinforcement λ is area under the curve. Right: Real-time reinforcement is future Problems: area under the curve. - Prediction is constant for all times prior to US. This disagrees with empirical observation Solution is imminence weighting to reduce long time association. Bert Kappen 10
Temporal difference learning [Sutton and Barto, 1990] Subject predicts at time t the imminence weighted area (c) rather than the unweighted area (a) which may be infinite. Bert Kappen 11
Temporal difference learning Dayan and Abbott 9.2 Consider time interval 0 t < with u(t) stimulus, λ(t) the reward. The expected (discounted=imminence weighted) future reward is R(t) = E γ τ t λ(τ) τ=t The aim of reinforcement learning is to learn a signal v(t) that predicts the expected future reward R(t) from the past stimulus. v(t) = T w(τ)u(t τ) τ=0 with w(τ) adaptive parameters. Minimize quadratic error C = 1 2 t=0 (v(t) R(t)) 2 with respect w(τ) yields w(τ) = ɛ C(t) w(τ) = ɛ t=0 (R(t) v(t)) u(t τ) τ = 0,..., T Bert Kappen 12
Temporal difference learning Dayan and Abbott 9.2 Problem is that we do not know R(t). But we can use the recursive relation R(t) = E γ τ t λ(τ) = λ(t) + γe γ τ t 1 λ(τ) = λ(t) + γr(t + 1) λ(t) + γv(t + 1) τ=t τ=t+1 We approximate R(t + 1) by v(t + 1). Update at time t becomes: w(τ) = ɛ (R(t) v(t)) u(t τ) ɛ (λ(t) + γv(t + 1) v(t)) u(t τ) Bert Kappen 13
Temporal difference learning Dayan and Abbott 9.2 When v(t) does not depend on stimulus, but is simply a parameter that is estimated for each t: C = 1 2 (R(t) v(t)) 2 t=0 v(t) = ɛ C v(t) = ɛ t=0 (R(t) v(t)) ɛ (r(t) + γv(t + 1) v(t)) t=0 Bert Kappen 14
Inter-Stimulus-Interval Dependency Dayan and Abbott 9.2 Rabbit CR (eye blink) to CS (sound) as a function of CS to US (puff) time interval shows good agreement with TD model. Bert Kappen 15
Fig. 9.2 Dayan and Abbott 9.2 Stimulus is u(t) = δ t,100, thus Eq. 9.6 v(t) = w(τ)u(t τ) = w(t 100) t 100 τ = 0 0 t < 100 Denote τ = t 100 with t 100, then v(t) = w(τ). Delta rule Eq. 9.10: w(τ) = ɛδ(t)u(t τ) = ɛδ(t)δ t τ,100 v(t) = w(τ) = ɛδ(t) δ(t) = r(t) + v(t + 1) v(t) and w(τ) = 0 if τ t 100. Reward is r = δ t,200. Bert Kappen 16
First iteration: Second iteration: Third iteration: Dayan and Abbott 9.2 v(t) = 0 δ(t) = δ t,200 v(t) = ɛδ t,200 δ(t) = (1 ɛ)δ t,200 + ɛδ t,199 v(t) = (2ɛ ɛ 2 )δ t,200 + ɛδ t,199 δ(t) =... δ(t) = v(t) + r(t) Bert Kappen 17
Dopamine Dayan and Abbott 9.2 Activity of dopamine neurons in ventral tegmental area (VTA) of monkey encode δ in fig. 9.2. Monkey button press after stimulus (sound) to receive reward (fruit juice). - A: Left panels, trials locked to stimulus, right panels trials locked to reward. Top row is early trials, bottom row is late trials. VTA cells respond to reward in early trials and to stimulus in late trials. - B Top: stimulus locked trials after learning as in bottom fig. A - B bottom: witholding the (expected) reward yields inhibition Bert Kappen 18
Instrumental conditioning/static action choice Dayan Abbott 9.3 Conditioning: Classical: reward (punishment) independent of action taken (Pavlov dog) Instrumental: reward depends on action taken Reward timing: immediate: reward follows directly after action (static action choice) delayed: reward follows after sequence of actions Bert Kappen 19
Foraging of bees Dayan Abbott 9.3 Foraging of bees as example of learning behavior with immediate reward: bees visit flowers of two colors (blue, yellow), prefering color with higher reward (nectar) when rewards are swapped, bees adjust their preference Quantity of nectar r b,y is stochastic from (fixed) distribution q r,b (r). This is like a two-armed bandit problem. The policy of the bees is given by a probability p b,y with p b + p r parametrize this as = 1. We can p b = e βm b e βm b + e βm y p y = e βm y e βm b + e βm y The parameters m b,y are called action values, that do no require normalization. The parameter β controls exploration-exploitation: β = 0: exploration, β = : exploitation Bert Kappen 20
Foraging of bees Dayan Abbott 9.3 Two learning strategies: learn action value as past expected reward (indirect actor) learn action value to maximize future expected reward (direct actor) Bert Kappen 21
Indirect actor Dayan Abbott 9.3 We need a learning rule to estimate m b = r b = q b (r b )r b m y = r b ry = q y (r y )r y r y Consider the cost function E(m b ) = 1 2 when de dm b = r b q b (r)(m b r b ) 2. The function is minimized q b (r b )(m b r b ) = m b r b = 0 r We use stochastic gradient descent, replacing r b by a single sample r b : m b := m b ɛ de dm b = m b + ɛ(r b m b ) and similar for m y : m y := m y + ɛ(r y m y ). This is the Delta rule. Bert Kappen 22
Indirect actor Dayan Abbott 9.3 r b = 1, r y = 2 for first 100 visits with qb (r) = 1 2 δ r,0 + 1 2 δ r,2, q y (r) = 1 2 δ r,0 + 1 2 δ r,4. Rewards are reversed for second 100 visits. A) Visits selected according action values with β = 1 and learning of m b,y with delta rule. B) Cumulated reward for β = 1 and (C,D) for β = 50 shows better exploitation but worse exploration (sometimes) for large β. This behavior is seen in real bees. Bert Kappen 23
Indirect actor Dayan Abbott 9.3 Real bees are risk averse. Blue flowers provide 2 µl nectar, yellow flowers provide 6µl on 1/3 of the trials and zero otherwise. After 15 trials reversed. A) mean preference of 5 bees for blue flowers over 30 trials, each consisting of 40 visits. B) action value is concave function of nectar volume prefers low risk option. C) preference of single model bee (ɛ = 0.3, β = 23/8) Bert Kappen 24
Direct actor Dayan Abbott 9.3 Direct actor method estimates m b,y such as to maximize the expected future reward r = p b r b + p y ry We use stochastic gradient ascent on r : m b := m b + ɛβp b p y ( rb r y ) = mb + ɛβp b (1 p b ) r b ɛβp y p b ry m y := m y + ɛβp b p y ( ry rb ) = m y + ɛβp y (1 p y ) r y ɛβpb p y r b where we used dp b dm b = βp b p y dp y dm b = βp b p y Bert Kappen 25
Direct actor Dayan Abbott 9.3 The stochastic version is δm b = ɛ(1 p b )r b δm y = ɛ p y r b i f b is selected δm b = ɛ p b r y δm y = ɛ(1 p y )r b i f y is selected For multiple actions the rule generalizes to, when action a is taken: δm a = ɛ(δ a,a p a )r a a (For instance, take a = b and a = b, y gives first line above). We will use this in the actor-critic algorithm (DA 9.4). Bert Kappen 26
Direct actor Dayan Abbott 9.3 Setting as before; two different runs. A,B) Succesful learning. C,D) Failure to learn to switch due to larger difference m y m b Illustrates that learning methods that directly optimize expected future reward may be too greedy and have poor exploration. Bert Kappen 27
Delayed reward/sequential decision making Dayan Abbott 9.4 Reward obtained after sequence of actions. Rat moves without back tracking. After reward removed from maze and restart. Delayed reward problem: Choice at A has no direct reward Bert Kappen 28
Delayed reward/sequential decision making Dayan Abbott 9.4 Policy iteration (see [Kaelbling et al., 1996] 3.2.2): Loop: Policy evaluation: Compute value V π for policy π. Run Bellman backup until convergence Policy improvement: Improve π Bert Kappen 29
Delayed reward/sequential decision making Dayan Abbott 9.4 Actor Critic (see [Kaelbling et al., 1996] 4.1): Loop: Critic: use TD eval. V(state) using current policy Actor: improve policy p(state) Bert Kappen 30
Policy evaluation with TD Dayan Abbott 9.4 Policy is random left/right at each turn. v(b) = 1 2 (0 + 5) = 2.5 v(c) = 1 2 (0 + 2) = 1 v(a) = 1 (v(b) + v(c)) = 1.75 2 Learn through TD learning (v(s) = w(s), s = A, B, C): v(s) := v(s) + ɛδ δ = r(s) + v(s ) v(s) Bert Kappen 31
Policy improvement Dayan Abbott 9.4 We use the direct actor method, which updates all m a upon taking action a as: m a := m a + ɛ(δ a,a p a )r a a In the maze, the actions (and action values) depend on the state and reward is delayed r a r a + v(s ) v(s): m a (s) := m a (s) + ɛ(δ a,a p a (s))δ δ = r a + v(s ) v(s) p a (s) = e βm a(s) a eβm a (s) Example: the values of the random policy are v(a) = 1.75 v(b) = 2.5 v(c) = 1 Bert Kappen 32
When in state A: Dayan Abbott 9.4 δ = 0 + v(b) v(a) = 0.75 f or right turn δ = 0 + v(c) v(a) = 0.75 f or le f t turn The learning rule increase m right (A) and decreases m left (A). β controls the exploration. Actor critic learning: show probability of left turn versus trials. Slow convergence of C due to less visits of right arm. Bert Kappen 33
Generalizations Dayan Abbott 9.4 Generalizations of the basis Actor-Critic model are: 1. State s may not be directly observable, but instead a vector u i (s) of sensory information is available The value function v(s) can be a parametrized function of u i (s). For instance a linear parametrization, the TD learning rule is v(s) = w i u i (s) w i := w i + ɛδu i (s) δ = r(s) + v(s ) v(s) i The action value m a (s) also becomes a function of u i instead of s directly: m a (s) = M ai u i (s) i M a i := M a i + ɛ(δ aa p a (s))δu i (s) Three-term learning rule: The connection between neurons i and a is changed depending on a third term δ 2. Discounted reward. Earlier reward/punishment more important than later. δ = r(s) + γv(s ) v(s) Bert Kappen 34
3. Updating value of past states in each learning step: TD(λ) Dayan Abbott 9.4 Bert Kappen 35
Water Maze Dayan Abbott 9.4 Rat is place in a large pool of milky water and has to swim around until find a small platform that is submerged. After several trials, the rat learns to locate the platform. (A) u i (s) is activity of place cell array, s is physical location (B) Critic: v(s) = i w i u i (s) is value function. (C upper) Actor: m a (s) = i M ai u i (s) is the action value (C lower) Bert Kappen 36
References Dayan Abbott 9.4 References [Dayan and Abbott, 2001] Dayan, P. and Abbott, L. (2001). Theoretical Neuroscience. Computational and Mathematical Modeling of Neural Systems. MIT Press, New York. [Kaelbling et al., 1996] Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement learning: a survey. Journal of Artificial Intelligence research, 4:237 285. [Sutton and Barto, 1990] Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement., pages 497 537. Cambridge University Press. Bert Kappen 37