David Silver, Google DeepMind

Size: px

Start display at page:

Download "David Silver, Google DeepMind"

Ernest Asher Watkins
6 years ago
Views:

1 Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind

2 Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep RL

3 Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making I RL is for an agent with the capacity to act I Each action influences the agent s future state I Success is measured by a scalar reward signal I Goal: select actions to maximise future reward

4 Deep Learning in a nutshell DL is a general-purpose framework for representation learning I Given an objective I Learn representation that is required to achieve objective I Directly from raw inputs I Using minimal domain knowledge

5 Deep Reinforcement Learning: AI = RL + DL We seek a single agent which can solve any human-level task I RL defines the objective I DL gives the mechanism I RL + DL = general intelligence

6 Examples of Deep I Play games: Atari, poker, Go,... I Explore worlds: 3D worlds, Labyrinth,... I Control physical systems: manipulate, walk, swim,... I Interact with users: recommend, optimise, personalise,...

7 Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep RL

8 o O o O o Deep Representations I A deep representation is a composition of many functions x / h 1 /... / h n / y / l w 1... w n I Its gradient can be backpropagated by the @h @h n n

9 Deep Neural Network A deep neural network is typically composed of: I Linear transformations I Non-linear activation functions h k+1 = Wh k h k+2 = f (h k+1 ) I A loss function on the output, e.g. I Mean-squared error l = y y 2 I Log likelihood l = log P [y ]

10 Training Neural Networks by Stochastic Gradient Descent I Sample gradient of expected loss L(w) = E @w = I Adjust w down the sampled gradient

11 ? O O = O O Weight Sharing Recurrent neural network shares weights between time-steps y t y t+1... / h t / h t+1 /... w x t w x t+1 Convolutional neural network shares weights between local regions w 1 w 2 w 1 w 2 h 2 x h 1

12 Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep RL

Machine Learning Reinforcement Learning Rationality/ Game Theory

13 Many Faces of Reinforcement Learning Computer Science Engineering Mathematics Optimal Control Operations Research Machine Learning Reinforcement Learning Rationality/ Game Theory Reward System Classical/Operant Conditioning Neuroscience Psychology Economics

14 Agent and Environment observation action I At each step t the agent: o t reward r t a t I Executes action at I Receives observation o t I Receives scalar reward r t I The environment: I Receives action at I Emits observation o t+1 I Emits scalar reward r t+1

15 State I Experience is a sequence of observations, actions, rewards o 1, r 1, a 1,...,a t 1, o t, r t I The state is a summary of experience s t = f (o 1, r 1, a 1,...,a t 1, o t, r t ) I In a fully observed environment s t = f (o t )

16 Major Components of an RL Agent I An RL agent may include one or more of these components: I Policy: agent s behaviour function I Value function: how good is each state and/or action I Model: agent s representation of the environment

17 Policy I A policy is the agent s behaviour I It is a map from state to action: I Deterministic policy: a = (s) I Stochastic policy: (a s) =P [a s]

18 Value Function I A value function is a prediction of future reward I How much reward will I get from action a in state s? I Q-value function gives expected total reward I from state s and action a I under policy I with discount factor Q (s, a) =E r t+1 + r t r t s, a

19 Value Function I A value function is a prediction of future reward I How much reward will I get from action a in state s? I Q-value function gives expected total reward I from state s and action a I under policy I with discount factor Q (s, a) =E r t+1 + r t r t s, a I Value functions decompose into a Bellman equation Q (s, a) =E s 0,a 0 r + Q (s 0, a 0 ) s, a

20 Optimal Value Functions I An optimal value function is the maximum achievable value Q (s, a) =max Q (s, a) =Q (s, a)

21 Optimal Value Functions I An optimal value function is the maximum achievable value Q (s, a) =max Q (s, a) =Q (s, a) I Once we have Q we can act optimally, (s) = argmax a Q (s, a)

22 Optimal Value Functions I An optimal value function is the maximum achievable value Q (s, a) =max Q (s, a) =Q (s, a) I Once we have Q we can act optimally, (s) = argmax a Q (s, a) I Optimal value maximises over all decisions. Informally: Q (s, a) =r t+1 + max r t max r t a t+1 a t+2 = r t+1 + max a t+1 Q (s t+1, a t+1 )

23 Optimal Value Functions I An optimal value function is the maximum achievable value Q (s, a) =max Q (s, a) =Q (s, a) I Once we have Q we can act optimally, (s) = argmax a Q (s, a) I Optimal value maximises over all decisions. Informally: Q (s, a) =r t+1 + max r t max r t a t+1 a t+2 = r t+1 + max a t+1 Q (s t+1, a t+1 ) I Formally, optimal values decompose into a Bellman equation apple Q (s, a) =E s 0 r + max Q (s 0, a 0 ) s, a a 0

24 Value Function Demo

25 Model observation action o t a t reward r t

26 Model observation o t reward r t action a t I Model is learnt from experience I Acts as proxy for environment I Planner interacts with model I e.g. using lookahead search

27 Approaches To Reinforcement Learning Value-based RL I Estimate the optimal value function Q (s, a) I This is the maximum value achievable under any policy Policy-based RL I Search directly for the optimal policy I This is the policy achieving maximum future reward Model-based RL I Build a model of the environment I Plan (e.g. by lookahead) using model

28 Deep Reinforcement Learning I Use deep neural networks to represent I I I Value function Policy Model I Optimise loss function by stochastic gradient descent

29 Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep RL

30 Q-Networks Represent value function by Q-network with weights w Q(s, a, w) Q (s, a) Q(s,a,w) Q(s,a 1,w) Q(s,a m,w) w w s a s

31 Q-Learning I Optimal Q-values should obey Bellman equation apple Q (s, a) =E s 0 r + max Q (s 0, a 0 ) s, a a 0 I Treat right-hand side r + max Q(s 0, a 0, w) as a target a 0 I Minimise MSE loss by stochastic gradient descent l = r + 2 max a 0 Q(s 0, a 0, w) Q(s, a, w)

32 Q-Learning I Optimal Q-values should obey Bellman equation apple Q (s, a) =E s 0 r + max Q (s 0, a 0 ) s, a a 0 I Treat right-hand side r + max Q(s 0, a 0, w) as a target a 0 I Minimise MSE loss by stochastic gradient descent l = r + 2 max a 0 Q(s 0, a 0, w) Q(s, a, w) I Converges to Q using table lookup representation

33 Q-Learning I Optimal Q-values should obey Bellman equation apple Q (s, a) =E s 0 r + max Q (s 0, a 0 ) s, a a 0 I Treat right-hand side r + max Q(s 0, a 0, w) as a target a 0 I Minimise MSE loss by stochastic gradient descent l = r + 2 max a 0 Q(s 0, a 0, w) Q(s, a, w) I Converges to Q using table lookup representation I But diverges using neural networks due to: I Correlations between samples I Non-stationary targets

34 Deep Q-Networks (DQN): Experience Replay To remove correlations, build data-set from agent s own experience s t, a t, r t+1, s t+1! s t, a t, r t+1, s t+1 s 1, a 1, r 2, s 2 s 2, a 2, r 3, s 3! s, a, r, s 0 s 3, a 3, r 4, s 4... Sample experiences from data-set and apply update l = r + 2 max a 0 Q(s 0, a 0, w ) Q(s, a, w) To deal with non-stationarity, target parameters w are held fixed

35 Deep Reinforcement Learning in Atari state action s t a t reward r t

36 DQN in Atari I End-to-end learning of values Q(s, a) from pixels s I Input state s is stack of raw pixels from last 4 frames I Output is Q(s, a) for 18 joystick/button positions I Reward is change in score for that step Network architecture and hyperparameters fixed across all games

37 DQN Results in Atari

DQN Atari Demo INNOVATIONS IN The microbiome T H E I N T E R N AT I O N A L W E E

com/articles/nature14236 DQN source code: sites.google.com/a/deepmind.

PAGES 486 & 529 EPIDEMIOLOGY COSMOLOGY QUANTUM PHYSICS SHARE DATA IN OUTBREAKS A

491 & 516 Forge open access to sequences and more A supermassive black hole at a

38 DQN Atari Demo INNOVATIONS IN The microbiome T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E DQN paper DQN source code: sites.google.com/a/deepmind.com/dqn/ Self-taught AI software attains human-level performance in video games PAGES 486 & 529 EPIDEMIOLOGY COSMOLOGY QUANTUM PHYSICS SHARE DATA IN OUTBREAKS A GIANT IN THE EARLY UNIVERSE TELEPORTATION FOR TWO PAGE 477 PAGES 490 & 512 PAGES 491 & 516 Forge open access to sequences and more A supermassive black hole at a redshift of 6.3 Transferring two properties of a single photon NATURE.COM/NATURE 26 February Vol. 518, No. 7540

39 Improvements since Nature DQN I Double DQN: Remove upward bias caused by max a I Current Q-network w is used to select actions I Older Q-network w is used to evaluate actions l = r + Q(s, a, w) 2 Q(s 0, argmax a 0 Q(s 0, a 0, w), w ) Q(s, a, w)

40 Improvements since Nature DQN I Double DQN: Remove upward bias caused by max a I Current Q-network w is used to select actions I Older Q-network w is used to evaluate actions l = r + Q(s, a, w) 2 Q(s 0, argmax a 0 Q(s 0, a 0, w), w ) Q(s, a, w) I Prioritised replay: Weight experience according to surprise I Store experience in priority queue according to DQN error r + max Q(s 0, a 0, w ) Q(s, a, w) a 0

41 Improvements since Nature DQN I Double DQN: Remove upward bias caused by max a I Current Q-network w is used to select actions I Older Q-network w is used to evaluate actions l = r + Q(s, a, w) 2 Q(s 0, argmax a 0 Q(s 0, a 0, w), w ) Q(s, a, w) I Prioritised replay: Weight experience according to surprise I Store experience in priority queue according to DQN error r + max Q(s 0, a 0, w ) Q(s, a, w) a 0 I Duelling network: Split Q-network into two channels I Action-independent value function V (s, v) I Action-dependent advantage function A(s, a, w) Q(s, a) =V (s, v)+a(s, a, w)

42 Improvements since Nature DQN I Double DQN: Remove upward bias caused by max a I Current Q-network w is used to select actions I Older Q-network w is used to evaluate actions l = r + Q(s, a, w) 2 Q(s 0, argmax a 0 Q(s 0, a 0, w), w ) Q(s, a, w) I Prioritised replay: Weight experience according to surprise I Store experience in priority queue according to DQN error r + max Q(s 0, a 0, w ) Q(s, a, w) a 0 I Duelling network: Split Q-network into two channels I Action-independent value function V (s, v) I Action-dependent advantage function A(s, a, w) Q(s, a) =V (s, v)+a(s, a, w) Combined algorithm: 3x mean Atari score vs Nature DQN

43 Gorila (General Reinforcement Learning Architecture) I 10x faster than Nature DQN on 38 out of 49 Atari games I Applied to recommender systems within Google

44 Asynchronous Reinforcement Learning I Exploits multithreading of standard CPU I Execute many instances of agent in parallel I Network parameters shared between threads I Parallelism decorrelates data I Viable alternative to experience replay I Similar speedup to Gorila - on a single machine!

45 Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep RL

46 Deep Policy Networks I Represent policy by deep network with weights u a = (a s, u) ora = (s, u) I Define objective function as total discounted reward L(u) =E r 1 + r r (, u) I Optimise objective end-to-end by SGD I i.e. Adjust policy parameters u to achieve more reward

47 Policy Gradients How to make high-value actions more likely: I The gradient of a stochastic policy (a s, u) is (a s, u) = E Q (s,

48 Policy Gradients How to make high-value actions more likely: I The gradient of a stochastic policy (a s, u) is (a s, u) = E Q (s, I The gradient of a deterministic policy a = (s) is = E @u I if a is continuous and Q is di erentiable

49 Actor-Critic Algorithm I Estimate value function Q(s, a, w) Q (s, a) I Update policy parameters u by stochastic gradient @log (a s, u) = Q(s, @u @u

50 Asynchronous Advantage Actor-Critic (A3C) I Estimate state-value function V (s, v) E [r t+1 + r t s] I Q-value estimated by an n-step sample q t = r t+1 + r t n 1 r t+n + n V (s t+n, v)

51 Asynchronous Advantage Actor-Critic (A3C) I Estimate state-value function V (s, v) E [r t+1 + r t s] I Q-value estimated by an n-step sample q t = r t+1 + r t n 1 r t+n + n V (s t+n, v) I Actor is updated towards (a t s t, u) (q t V (s t, I Critic is updated to minimise MSE w.r.t. target l v =(q t V (s t, v)) 2

52 Asynchronous Advantage Actor-Critic (A3C) I Estimate state-value function V (s, v) E [r t+1 + r t s] I Q-value estimated by an n-step sample q t = r t+1 + r t n 1 r t+n + n V (s t+n, v) I Actor is updated towards (a t s t, u) (q t V (s t, I Critic is updated to minimise MSE w.r.t. target l v =(q t V (s t, v)) 2 I 4x mean Atari score vs Nature DQN

53 Deep Reinforcement Learning in Labyrinth

54 A3C in Labyrinth π(a s t-1 ) V(s t-1 ) π(a s t ) V(s t ) π(a s t+1 ) V(s t-1 ) s t-1 s t s t+1 o t-1 o t o t+1 I End-to-end learning of softmax policy (a s t ) from pixels I Observations o t are raw pixels from current frame I State s t = f (o 1,...,o t ) is a recurrent neural network (LSTM) I Outputs both value V (s) and softmax over actions (a s) I Task is to collect apples (+1 reward) and escape (+10 reward)

55 A3C Labyrinth Demo Demo: Labyrinth source code (coming soon): sites.google.com/a/deepmind.com/labyrinth/

56 Deep Reinforcement Learning with Continuous Actions How can we deal with high-dimensional continuous action spaces? I Can t easily compute max Q(s, a) a I Actor-critic algorithms learn without max I Q-values are di erentiable w.r.t a I Deterministic policy gradients exploit

57 Deep DPG DPG is the continuous analogue of DQN I Experience replay: build data-set from agent s experience I Critic estimates value of current policy by DQN l w = 2 r + Q(s 0, (s 0, u ), w ) Q(s, a, w) To deal with non-stationarity, targets u, w I Actor updates policy in direction that improves @u I In other words critic provides loss function for actor are held fixed

58 DPG in Simulated Physics I Physics domains are simulated in MuJoCo I End-to-end learning of control policy from raw pixels s I Input state s is stack of raw pixels from last 4 frames I Two separate convnets are used for Q and I Policy is adjusted in direction that most improves Q a Q(s,a) π(s)

59 DPG in Simulated Physics Demo I Demo: DPG from pixels

60 A3C in Simulated Physics Demo I Asynchronous RL is viable alternative to experience replay I Train a hierarchical, recurrent locomotion controller I Retrain controller on more challenging tasks

61 Fictitious Self-Play (FSP) Can deep RL find Nash equilibria in multi-agent games? I Q-network learns best response to opponent policies I I By applying DQN with experience replay c.f. fictitious play I Policy network (a s, u) learns an average of (a s, I Actions a sample mix of policy network and best response

62 Neural FSP in Texas Hold em Poker I Heads-up limit Texas Hold em I NFSP with raw inputs only (no prior knowledge of Poker) I vs SmooCT (3x medal winner 2015, handcrafted knowlege) mbb/h SmooCT NFSP, best response strategy -700 NFSP, greedy-average strategy NFSP, average strategy e+06 1e e+07 2e e+07 3e e+07 Iterations

63 Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep RL

64 Learning Models of the Environment I Demo: generative model of Atari I Challenging to plan due to compounding errors I Errors in the transition model compound over the trajectory I Planning trajectories di er from executed trajectories I At end of long, unusual trajectory, rewards are totally wrong

CONSERVATION PAGE 452 THE INTERNATIONAL WEEKLY JOURNAL OF SCIENCE RESEARCH ETHICS PAGE 459 POPULAR SCIENCE

7587 Deep Reinforcement Learning in Go What if we have a perfect model? e.g. game rules are known AlphaGo paper: www.

com/alphago/ At last a computer program that can beat a champion Go player PAGE 484 ALL SYSTEMS GO

65 CONSERVATION PAGE 452 THE INTERNATIONAL WEEKLY JOURNAL OF SCIENCE RESEARCH ETHICS PAGE 459 POPULAR SCIENCE PAGE 462 NATURE.COM/NATURE 28 January Vol. 529, No Deep Reinforcement Learning in Go What if we have a perfect model? e.g. game rules are known AlphaGo paper: AlphaGo resources: deepmind.com/alphago/ At last a computer program that can beat a champion Go player PAGE 484 ALL SYSTEMS GO SONGBIRDS À LA CARTE Illegal harvest of millions of Mediterranean birds SAFEGUARD TRANSPARENCY Don t let openness backfire on individuals WHEN GENES GOT SELFISH Dawkins s calling card forty years on

66 Conclusion I General, stable and scalable RL is now possible I Using deep networks to represent value, policy, model I Successful in Atari, Labyrinth, Physics, Poker, Go I Using a variety of deep RL paradigms

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015