Reinforcement Learning and NLP

Similar documents
Policy Gradient Methods. February 13, 2017

Reinforcement Learning

Trust Region Policy Optimization

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

CSC321 Lecture 22: Q-Learning

Lecture 9: Policy Gradient II (Post lecture) 2

REINFORCEMENT LEARNING

Lecture 9: Policy Gradient II 1

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Q-Learning in Continuous State Action Spaces

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Lecture 8: Policy Gradient I 2

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Deep Reinforcement Learning via Policy Optimization

Lecture 8: Policy Gradient

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning via Policy Optimization

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

CS599 Lecture 1 Introduction To RL

Reinforcement Learning for NLP

(Deep) Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

, and rewards and transition matrices as shown below:

Reinforcement Learning as Variational Inference: Two Recent Approaches

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 3: Markov Decision Processes

Introduction of Reinforcement Learning

Reinforcement Learning. Introduction

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Introduction to Reinforcement Learning

Internet Monetization

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning: An Introduction

Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Lecture 1: March 7, 2018

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Notes on Reinforcement Learning

Deep Reinforcement Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning

Approximation Methods in Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Temporal Difference Learning & Policy Iteration

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

BACKPROPAGATION THROUGH THE VOID

Reinforcement Learning. Machine Learning, Fall 2010

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning: the basics

Decision Theory: Q-Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Combining PPO and Evolutionary Strategies for Better Policy Search

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

MDP Preliminaries. Nan Jiang. February 10, 2019

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Real Time Value Iteration and the State-Action Value Function

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

6 Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

CS230: Lecture 9 Deep Reinforcement Learning

Lecture 23: Reinforcement Learning

Nonparametric Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks. Siddharthan Rajasekaran

Reinforcement Learning and Control

Reinforcement learning an introduction

Reinforcement Learning. George Konidaris

Deep Reinforcement Learning. Scratching the surface

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Olivier Sigaud. September 21, 2012

Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Contributions to deep reinforcement learning and its applications in smartgrids

Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

An Adaptive Clustering Method for Model-free Reinforcement Learning

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

Transcription:

1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH

Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value functions Actor-critic methods Policy gradients in NLP Non-differentiable metrics Latent structure

Reinforcement learning 3 Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child s? If this were then subjected to an appropriate course of education one would obtain the adult brain. Alan Turing Computing Machinery and Intelligence (1950)

Reinforcement learning 4 Sequential decision making Learn to model behavior over time Rewards may be stochastic and delayed Trade off exploration vs exploitation Generalization of supervised learning No full access to function to optimize Stateful environment, input affected by previous actions Nonstationarity for samples (no i.i.d. assumption)

Reinforcement learning 5 Agent Environment

Reinforcement learning 5 actions Agent Environment observations

Reinforcement learning 5 actions Agent Environment observations states rewards

Reinforcement learning 6 Model-free Policy-based: learn how to take actions in each state Value-based: learn the value of actions in each state Model-based Model environment to predict next states and rewards

Markov Decision Process (MDP) 7 Discrete-time stochastic control process defined by S, A, P, R, γ S: set of states A: set of actions P: transition probability distribution Pss a = p(s t+1 = s s t = s, a t = a) R: reward function R a ss = E [r t s t = s, a t = a] γ [0, 1): discount factor

Markov Decision Process (MDP) 7 Discrete-time stochastic control process defined by S, A, P, R, γ S: set of states s s A: set of actions P: transition probability distribution s Pss a = p(s t+1 = s s t = s, a t = a) R: reward function R a ss = E [r t s t = s, a t = a] γ [0, 1): discount factor

Markov Decision Process (MDP) 7 Discrete-time stochastic control process defined by S, A, P, R, γ S: set of states s s A: set of actions P: transition probability distribution s Pss a = p(s t+1 = s s t = s, a t = a) R: reward function R a ss = E [r t s t = s, a t = a] γ [0, 1): discount factor

Markov Decision Process (MDP) 7 Discrete-time stochastic control process defined by S, A, P, R, γ S: set of states s s A: set of actions P: transition probability distribution s Pss a = p(s t+1 = s s t = s, a t = a) R: reward function R a ss = E [r t s t = s, a t = a] γ [0, 1): discount factor

Markov Decision Process (MDP) 7 Discrete-time stochastic control process defined by S, A, P, R, γ S: set of states s s A: set of actions P: transition probability distribution s Pss a = p(s t+1 = s s t = s, a t = a) R: reward function R a ss = E [r t s t = s, a t = a] γ [0, 1): discount factor

Markov Decision Process (MDP) 7 Discrete-time stochastic control process defined by S, A, P, R, γ S: set of states s s A: set of actions P: transition probability distribution s Pss a = p(s t+1 = s s t = s, a t = a) R: reward function R a ss = E [r t s t = s, a t = a] γ [0, 1): discount factor

Markov Decision Process (MDP) 8 Goal: Take actions to maximize expected return over trajectories Trajectory τ: path through state space up to a horizon τ = s 0, a 0, r 0, s 1, a 1, r 1, s 2, a 2,... Episodic setting: agent acts until a terminal state is reached Return R t : cumulative future discounted rewards from s t R t = r t + γr t+1 + γ 2 r t+2 + = Assuming infinite horizon: if γ 1, R t is unbounded if 0 γ < 1, R t is well-defined and converges γ k r t+k k=0

Markov Decision Process (MDP) 9 Goal: Learn policy to maximize expected return over trajectories Policy π(s, a) represents action probabilities p(a s) in state s Deterministic policy: a t = π(s t) Stochastic policy: a t π(a s t) borrowing conditional prob notation Learn parameterized policy π θ with neural network weights θ Network architecture follows action space A: Discrete A: softmax layer to represent p(a s t) Continuous A: output µ and diagonal σ to sample a t N (µ, σ)

Derivative-free optimization 10 Objective: maximize E τ θ [R(τ)] Treat policy as a black box with parameters θ R d Iteratively update θ to make good returns more likely

11 Cross-entropy method Mannor et al (2003) The Cross Entropy method for Fast Policy Search Initialize µ 0 R d, σ 0 R d At each iteration i, draw L samples for θ l N (µ i, σi 2) Evaluate each trajectory τ l using parameters θ l Select the top ρ% of samples by R(τ l ) as the elite set Fit a new diagonal Gaussian to the elite set to obtain µ i+1, σ i+1 At convergence, return final µ + No gradients needed; only forward pass + Converges quickly + Remarkably effective on many problems, e.g., Tetris Needs lots of samples

12 Evolution strategies Salimans et al (2017) Evolution Strategies as a Scalable Alternative to Reinforcement Learning Initialize θ 0 R d At each iteration i, sample Gaussian noise ɛ 1,..., ɛ L N (0, I) Perturb θ i with each ɛ l to get θ l = θ i + σɛ l Evaluate each trajectory τ l using parameters θ l and update θ i+1 = θ i + η σl L R(τ l ) ɛ l l=1 + No gradients needed; only forward pass + Easy to parallelize with low communication overhead + Competitive on Atari, OpenAI benchmarks Requires 3-10x more data

Policy gradient 13 Learn to increase expected return by gradient ascent over θ θ i+1 = θ i + η θ E τ [R(τ)]

13 Policy gradient: reinforce Williams (1992) Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning Learn to increase expected return by gradient ascent over θ θ i+1 = θ i + η θ E τ [R(τ)] θ E τ [R(τ)] = τ = τ = τ R(τ) θ p(τ θ) R(τ) θ p(τ θ) p(τ θ) p(τ θ) R(τ) θ log p(τ θ) p(τ θ) = E τ [R(τ) θ log p(τ θ)] computed using sample averages 1 L R(τ l ) θ log p(τ l θ) L l=1

14 Policy gradient: reinforce Williams (1992) Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning Learn to increase expected return by gradient ascent over θ θ i+1 = θ i + η θ E τ [R(τ)] = θ i + η E τ [R(τ) θ log p(τ θ)] i.e., increase logprob of τ proportional to return R(τ) + Unbiased estimator of gradient + Valid even if R is discontinuous or unknown + Only need p(τ θ) to be differentiable High variance, particularly for long trajectories Assigns credit to whole trajectory rather than individual actions Large number of samples needed

Variance reduction: baseline If R(τ) 0 τ Estimator always modifies density θ i doesn t stabilize with fixed η 15 Subtract a baseline from the reward + Reduces variance θ E τ [R(τ)] = θ E τ [R(τ) b] + Estimator remains unbiased = E τ [(R(τ) b) θ log p(τ θ)] E τ b θ log p(τ θ) = b θ p(τ θ) = b θ 1 = 0 Using estimate of E [R(τ)] for b is a near-optimal choice i.e., increase logprob of τ proportional to how much return R(τ) is better than expected τ

Variance reduction 16 T 1 T 1 θ log p(τ θ) = θ log π θ (a t s t )Ps at ts t+1 = θ log π θ (a t s t ) t=0 t=0 + No need to model environment P sts t+1 (hence model-free) Rewards are distributed over trajectory [( T 1 ) T 1 ] θ E τ [R(τ)] = E τ r t θ log π θ (a t s t ) t=0 t=0 Instead, consider estimator for reward at timestep t [ ] t θ E τ [r t ] = E θ log π θ (a t s t ) r t t =0

Variance reduction 17 Sum over timesteps for full return [ T 1 θ E τ [R(τ)] = E τ Can add baseline for each state t=0 t=0 r t ] t θ log π θ (a t s t ) t =0 [ T 1 T 1 = E τ θ log π θ (a t s t ) [ T 1 θ E τ [R(τ)] = E τ θ log π θ (a t s t ) t=0 ( T 1 t =t r t ] )] r t b(s t ) t =t i.e., increase logprob of action a t proportional to how much future return R t (τ) = T 1 t =t r t is better than expected + Ignores prior rewards r 0,..., r t 1 when evaluating action a t at s t + Estimator remains unbiased as long as b doesn t depend on a t

Value functions 18 State-value function V π (s): Expected value of being in state s and following policy π V π (s) = E π [R t s t = s] State-action-value function Q π (s, a): Expected value of taking action a from s and then following π Q π (s) = E π [R t s t = s, a t = a] Advantage function A π (s, a): Expected value of taking action a from s instead of following π A π (s, a) = Q π (s, a) V π (s)

Value functions 19 [ T 1 T 1 θ E τ [R(τ)] = E τ θ log π θ (a t s t ) t=0 t =t r t ] + baseline term [ [ T 1 T 1 = E s0...a t θ log π θ (a t s t ) E rt...s T t=0 t =t r t Q π (s t, a t ) Substituting and reintroducing the baseline [ T 1 ] θ E τ [R(τ)] = E τ θ log π θ (a t s t ) (Q π (s t, a t ) b(s t )) t=0 ]] Defining b(s t ) = V π (s t ) is near-optimal (Greenmith et al, 2004) [ T 1 ] θ E τ [R(τ)] = E τ θ log π θ (a t s t )A π (s t, a t ) t=0

Value functions 20 [ T 1 ] θ E τ [R(τ)] = E τ θ log π θ (a t s t )A π (s t, a t ) t=0 Want E τ [A π ] = 0 to keep variance low i.e., positive advantage for good actions, negative for bad actions Don t know A π, can use an advantage estimator  such as Â(s t ) = r t + r t+1 + r t+2 + r t+3 +... b(s t ) + Unbiased estimator High variance from single sample estimate Confounds the effect of actions a t, a t+1, a t+2,...

Variance reduction 21 Use discounted return when calculating advantage Â(s t ) = r t + γr t+1 + γ 2 r t+2 + γ 3 r t+3 +... b(s t ) γ < 1 discounts the effect of actions that are far in the future To keep E τ [A π ] = 0, also use discounted return when fitting baseline to V π (s t ) + Emphasizes near-term rewards when evaluating actions + Lowers variance Biased estimator

Variance reduction 22 Use V π to estimate future rewards Â(s t ) = r t + γr t+1 + γ 2 r t+2 + γ 3 r t+3 +... b(s t ) = r t + γr t+1 + γ 2 r t+2 + γ 3 V π (s t+3 ) b(s t ) = r t + γr t+1 + γ 2 V π (s t+2 ) b(s t ) = r t + γv π (s t+1 ) b(s t ) Use estimator of discounted V π for both future rewards and baseline Â(s t ) = r t + γ ˆV t (s t+1 ) ˆV t (s t ) + Explicit bias-variance tradeoff i.e., fewer reward terms lower variance of estimator but increase bias

Actor-critic methods 23 Â(s t ) = r t + γ ˆV (s t+1 ) ˆV (s t ) Actor: policy π θ (a t s t ) Critic: value function ˆV (s t ) Simplified algorithm: Evaluate policy π θi to collect samples s t, R t Fit ˆV by minimizing n ˆV (s n) R n 2 Update policy parameters using ˆV [ T 1 ( θ i+1 = θ i + η E τ θ log π θ (a t s t) r t + γ ˆV (s t+1) ˆV ) ] (s t) t=0

Resources 24 Reinforcement Learning: An Introduction (Sutton & Barto, 2017) http://incompleteideas.net/book/bookdraft2017nov5.pdf OpenAI Spinning Up: code, environment, papers https://spinningup.openai.com/ Berkeley RL course materials http://rail.eecs.berkeley.edu/deeprlcourse/

Policy gradients in NLP 25 Optimization over non-differentiable metrics e.g., text generation metrics like BLEU, ROUGE, CIDEr etc Learned scores, e.g., neural teachers End-to-end training of sub-models for latent structure Sample from policy over latent variables Back-propagate loss to policy network

26 Non-differentiable metrics Rennie et al (2016) Self-critical Sequence Training for Image Captioning Optimize against CIDEr, BLEU, ROUGE, etc using reinforce Use score of greedy decoding of output words as a baseline i.e., encourage exploration of new words that improve rewards

26 Non-differentiable metrics Rennie et al (2016) Self-critical Sequence Training for Image Captioning Optimize against CIDEr, BLEU, ROUGE, etc using reinforce Use score of greedy decoding of output words as a baseline i.e., encourage exploration of new words that improve rewards

27 Non-differentiable metrics Bosselut et al (2018) Discourse-Aware Neural Rewards for Coherent Text Generation Train teacher network to score how well-ordered a text is Incorporate teacher scores as a reward in SCST (Rennie et al, 2016) to encourage generation of coherent text

28 Latent structure Andreas et al (2016) Learning to Compose Neural Networks for Question Answering Dynamically assemble a network for each question from modules Sample module layout from policy and backpropagate loss for answers to modules with reinforce

28 Latent structure Andreas et al (2016) Learning to Compose Neural Networks for Question Answering Dynamically assemble a network for each question from modules Sample module layout from policy and backpropagate loss for answers to modules with reinforce

29 Latent structure Lei et al (2016) Rationalizing Neural Predictions Define network to propose sequences of words from input text as rationales for aspect-based sentiment analysis Sample rationales during inference and convey squared error on task to rationale generator with reinforce

29 Latent structure Lei et al (2016) Rationalizing Neural Predictions Define network to propose sequences of words from input text as rationales for aspect-based sentiment analysis Sample rationales during inference and convey squared error on task to rationale generator with reinforce