Reinforcement Learning for NLP

Size: px

Start display at page:

Download "Reinforcement Learning for NLP"

Geoffrey Garrett
5 years ago
Views:

1 Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 1 of 1

2 I used to say that RL wasn t used in NLP... Now it s all over the place Part of much of ML hype But what is reinforcement learning? Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 2 of 1

3 I used to say that RL wasn t used in NLP... Now it s all over the place Part of much of ML hype But what is reinforcement learning? RL is a general-purpose framework for decision-making RL is for an agent with the capacity to act Each action influences the agent s future state Success is measured by a scalar reward signal Goal: select actions to maximise future reward Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 2 of 1

4 At each step t the agent: Executes action a t Receives observation o t Receives scalar reward r t The environment: Receives action a t Emits observation o t +1 Emits scalar reward r t +1 Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 3 of 1

5 Example QA MT State Words Seen Foreign Words Seen Reward Answer Accuracy Translation Quality Actions Answer / Wait Translate / Wait Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 4 of 1

6 State Experience is a sequence of observations, actions, rewards o 1, r 1,a 1,...,a t 1,o t, r t (1) The state is a summary of experience s t = f (o 1, r 1,a 1,...,a t 1,o t, r t ) (2) In a fully observed environment s t = f (o t ) (3) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 5 of 1

7 What makes an RL agent? Policy: agent s behaviour function Value function: how good is each state and/or action Model: agent s representation of the environment Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 6 of 1

8 Policy A policy is the agent s behavior It is a map from state to action: Deterministic policy: a = π(s ) Stochastic policy: π(a s ) = p(a s ) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 7 of 1

9 Value Function A value function is a prediction of future reward: How much reward will I get from action a in state s? Q -value function gives expected total reward from state s and action a under policy π with discount factor γ (future rewards mean less than immediate) Q π (s,a ) = E r t +1 + γr t +2 + γ 2 r t s,a (4) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 8 of 1

10 A Value Function is Great! An optimal value function is the maximum achievable value Q (s,a ) = max π Q π (s,a ) = Q π (s,a ) (5) If you know the value function, you can derive policy π = argmaxq (s,a ) (6) a Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 9 of 1

11 Approaches to RL Value-based RL Estimate the optimal value function Q (s,a ) This is the maximum value achievable under any policy Policy-based RL Search directly for the optimal policy π This is the policy achieving maximum future reward Model-based RL Build a model of the environment Plan (e.g. by lookahead) using model Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 10 of 1

12 Deep Q Learning Optimal Q -values should obey equation Q (s,a ) = E s Treat as regression problem Minimize: r + γmax a Q (s,a, w ) Q (s,a, w ) 2 Converges to Q using table lookup representation But diverges using neural networks due to: Correlations between samples Non-stationary targets r + γq (s,a ) s,a (7) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 11 of 1

13 Deep RL in Atari Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 12 of 1

14 DQN in Atari End-to-end learning of values Q (s,a ) from pixels s Input state s is stack of raw pixels from last four frames Output is Q (s,a ) for 18 joystick/button positions Reward is change in score for that step Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 13 of 1

15 Atari Results Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 14 of 1

16 Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 15 of 1

17 Optimal Policies Sometimes Stochastic Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 16 of 1

18 Optimal Policies Sometimes Stochastic (Cannot distinguish gray states) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 16 of 1

19 Optimal Policies Sometimes Stochastic Deterministic (Cannot distinguish gray states) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 16 of 1

20 Optimal Policies Sometimes Stochastic Deterministic (Cannot distinguish gray states) Value-based RL learns near deterministic policy! Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 16 of 1

21 Optimal Policies Sometimes Stochastic Stochastic (Cannot distinguish gray states, so flip a coin!) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 16 of 1

22 Likelihood Ratio Policy Gradient Let τ be state-action s 0, u 0,..., s H, u H. Utility of policy π parametrized by θ is H U (θ ) = E πθ,u R (s t, u t );π θ = P (τ;θ )R (τ). (8) Our goal is to find θ : max θ t U (θ ) = max θ t a u p(τ;θ )R (τ) (9) t Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 17 of 1

23 Likelihood Ratio Policy Gradient Taking the gradient wrt θ : p(τ;θ )R (τ) (10) t (11) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 18 of 1

24 Likelihood Ratio Policy Gradient Taking the gradient wrt θ : p(τ;θ )R (τ) (10) t θ U (θ ) = R (τ) P (τ;θ ) P (τ;θ ) θ P (τ;θ ) (11) τ Move differentiation inside sum (ignore R (τ) and then add in term that cancels out (12) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 18 of 1

25 Likelihood Ratio Policy Gradient Taking the gradient wrt θ : Move derivative over probability p(τ;θ )R (τ) (10) t θ U (θ ) = R (τ) P (τ;θ ) P (τ;θ ) θ P (τ;θ ) (11) τ = P (τ;θ ) θ P (τ;θ ) R (τ) (12) P (τ;θ ) τ (13) Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 18 of 1

26 Likelihood Ratio Policy Gradient Taking the gradient wrt θ : Assume softmax form p(τ;θ )R (τ) (10) t θ U (θ ) = R (τ) P (τ;θ ) P (τ;θ ) θ P (τ;θ ) (11) τ = P (τ;θ ) θ P (τ;θ ) R (τ) (12) P (τ;θ ) τ = P (τ;θ ) θ logp (τ;θ ) R (τ) (13) τ Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 18 of 1

27 Likelihood Ratio Policy Gradient p(τ;θ )R (τ) (10) t Taking the gradient wrt θ : = P (τ;θ ) θ logp (τ;θ ) R (τ) (11) τ Approximate with empirical estimate for m sample paths from π θ U (θ ) 1 m m θ logp (r i ;θ )R (τ i ) (12) 1 Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 18 of 1

28 Policy Gradient Intuition Increase probability of paths with positive R Decrease probability of paths with negagive R Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 19 of 1

29 Extensions Consider baseline b (e.g., path averaging) θ U (θ ) 1 m m θ logp (r i ;θ )(R (τ i ) b (τ)) (13) 1 Combine with value estimation (critic) Critic: Updates action-value function parameters Actor: Updates policy parameters in direction suggested by critic Advanced Machine Learning for NLP Boyd-Graber Reinforcement Learning for NLP 20 of 1

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning