CS 188: Artificial Intelligence
|
|
- Julian Payne
- 6 years ago
- Views:
Transcription
1 CS 188: Artificial Intelligence Reinforcement Learning Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein and Pieter
2 Reinforcement Learning
3 Double Bandits
4 Double-Bandit MDP Actions: Blue, Red States: Win, Lose $1 W 0.75 $ $ $0 L $1 No discount 10 time steps Both states have the same value $2 1.0
5 Offline Planning Solving MDPs is offline planning You determine all quantities through computation You need to know the details of the MDP You do not actually play the game! No discount 10 time steps Both states have the same value Play Red Value 15 $1 W 0.75 $ $ $0 L $1 Play Blue $2 1.0
6 Let s Play! $2 $2 $0 $2 $2 $2 $2 $0 $0 $0
7 Online Planning Rules changed! Red s win chance is different.?? $0 $1 1.0 W?? $2?? $2?? $0 L $1 1.0
8 Let s Play! $0 $0 $0 $2 $0 $2 $0 $0 $0 $0
9 What Just Happened? That wasn t planning, it was learning! Specifically, reinforcement learning There was an MDP, but you couldn t solve it with just computation You needed to actually act to figure it out Important ideas in reinforcement learning that came up Exploration: you have to try unknown actions to get information Exploitation: eventually, you have to use what you know Regret: even if you learn intelligently, you make mistakes Sampling: because of chance, you have to try things repeatedly Difficulty: learning can be much harder than solving a known MDP
10 Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Still looking for a policy (s) New twist: don t know T or R I.e. we don t know which states are good or what the actions do Must actually try actions and states out to learn
11 Reinforcement Learning Agent State: s Reward: r Actions: a Environment Basic idea: Learn how to maximize expected rewards based on observed samples of transitions
12 Example: Learning to Walk Initial After Learning [1K Trials] [Kohl and Stone, ICRA 2004]
13 Example: Learning to Walk [Kohl and Stone, ICRA 2004] Initial [Video: AIBO WALK initial]
14 Example: Learning to Walk [Kohl and Stone, ICRA 2004] Finished [Video: AIBO WALK finished]
15 Example: Sidewinding [Andrew Ng] [Video: SNAKE climbstep+sidewinding]
16 Example: Toddler Robot [Tedrake, Zhang and Seung, 2005] [Video: TODDLER 40s]
17 The Crawler! [Demo: Crawler Bot (L10D1)] [You, in Project 3]
18 Video of Demo Crawler Bot
19 DeepMind Atari ( Two Minute Lectures) 19
20 Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s S A set of actions (per state) A A model P(s a,s) A reward function R(s,a,s ) Still looking for a policy (s) New twist: don t know P or R I.e. we don t know which states are good or what the actions do Must actually try actions and explore new states -- to boldly go where no Pacman agent has been before
21 Offline (MDPs) vs. Online (RL) Offline Solution Online Learning
22 Model-Based Learning
23 Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s for each s, a Normalize to give an estimate of P(s s, a) Discover each R(s,a,s ) when we experience the transition Step 2: Solve the learned MDP For example, use value or policy iteration, as before
24 Example: Model-Based Learning Input Policy A B C D E Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10 Learned Model T(s,a,s ). P(s s,a) P(C B, east) = 1.00 P(D C, east) = 0.75 P(A C, east) = 0.25 R(s,a,s ). R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10
25 Pro: Makes efficient use of experiences Pros and cons Con: May not scale to large state spaces Learns model one state-action pair at a time (but this is fixable) Cannot solve MDP for very large S
26 Model-Free Learning
27 Basic idea of model-free methods To approximate expectations with respect to a distribution, you can either Estimate the distribution from samples, compute an expectation Or, estimate the expectation from samples directly
28 Example: Expected Age Goal: Compute expected age of cs188 students Known P(A) E[A] = a P(a) a = 0.35 x 20 + Without P(A), instead collect samples [a 1, a 2, a N ] Unknown P(A): Model Based Unknown P(A): Model Free Why does this work? Because eventually you learn the right model. P^(A) = N a /N E[A] a P^ (a) a E[A] 1/N i a i Why does this work? Because samples appear with the right frequencies.
29 Passive Reinforcement Learning
30 Passive Reinforcement Learning Simplified task: policy evaluation Input: a fixed policy (s) You don t know the transitions P(s s,a) You don t know the rewards R(s,a,s ) Goal: learn the state values V (s) In this case: Learner is along for the ride No choice about what actions to take: just do it This is NOT offline planning! Agent takes actions in the world.
31 Direct utility estimation Goal: Estimate V (s), i.e., expected total discounted reward from s onwards Idea: Use the actual sum of discounted rewards from s Average over multiple trials and visits to s This is called direct utility estimation
32 Example: Direct Evaluation Input Policy A B C D E Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10 Output Values -10 A B C D E -2
33 Problems with Direct Evaluation What s good about direct evaluation? It s easy to understand It doesn t require any knowledge of P(s s,a) or R(s,a,s ) It converges to the right answer in the limit What s bad about it? Each state must be learned separately (fixable) It ignores information about state connections So, it takes a long time to learn Output Values -10 A B C D E -2 If B and E both go to C under this policy, how can their values be different?
34 Why Not Use Policy Evaluation? Simplified Bellman updates calculate V for a fixed policy: Each round, replace V with a one-step-look-ahead layer over V This approach fully exploited the connections between the states Unfortunately, we need T and R to do it! s, (s),s s (s) s, (s) s Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights?
35 Sample-Based Policy Evaluation? We want to improve our estimate of V by computing these averages: Idea: Take samples of outcomes s (by doing the action!) and average s (s) s, (s) s, (s),s s 2' s' s 1 ' ' s 3 Almost! But we can t rewind time to get sample after sample from state s.
36 Temporal difference (TD) learning
37 TD as approximate Bellman update Policy evaluation (version 1) improves the estimate of V by computing Bellman update, i.e., expectation over nextstate values: V k+1(s) s P(s (s),s) [R(s, (s),s ) + γv k (s ) ] Idea 1: Use actual samples to estimate the expectation: sample 1 = R(s, (s),s 1 ) + γv k (s 1 ) sample 2 = R(s, (s),s 2 ) + γv k (s 2 ) sample N = R(s, (s),s N ) + γv k (s N ) V k+1(s) 1/N i sample i
38 TD as approximate Bellman update Idea 2: Update value of s after each transition s,a,s,r : Update V ([3,1]) based on R([3,1],up,[3,2]) and γv ([3,2]) Update V ([3,2]) based on R([3,2],up,[3,3]) and γv ([3,3]) Update V ([3,3]) based on R([3,3],right,[4,3]) and γv ([4,3])
39 TD as approximate Bellman update Idea 3: Update values by maintaining a running average
40 Running averages How do you compute the average of 1, 4, 7? Method 1: add them up and divide by N = 12 average = 12/N = 12/3 = 4 Method 2: keep a running average n and a running count n n=0 0 =0 n=1 1 = (0 0 + x 1 )/1 = ( )/1 = 1 n=2 2 = (1 1 + x 2 )/2 = ( )/2 = 2.5 n=3 3 = (2 2 + x 3 )/3 = ( )/3 = 4 General formula: n = ((n-1) n-1 + x n )/n = [(n-1)/n] n-1 + [1/n] x n (weighted average of old mean, new sample)
41 Running averages contd. What if we use a weighted average with a fixed weight? n = (1- ) n-1 + x n n=1 1 = x 1 n=2 2 = (1- ) 1 + x 2 = (1- ) x 1 + x 2 n=3 3 = (1- ) 2 + x 3 = (1- ) 2 x 1 + (1- )x 2 + x 3 n=4 4 = (1- ) 3 + x 4 = (1- ) 3 x 1 + (1- ) 2 x 2 + (1- )x 3 + x 4 I.e., exponential forgetting of old values
42 TD as approximate Bellman update Idea 3: Update values by maintaining a running average sample = R(s, (s),s ) + γv (s ) V (s) (1- ) V (s) + sample V (s) V (s) + [sample - V (s)] This is the temporal difference learning rule [sample - V (s)] is the TD error I.e., observe a sample, move V (s) a little bit to make it more consistent with its neighbor V (s )
43 Example: Temporal Difference Learning States Observed Transitions B, east, C, -2 C, east, D, -2 A B C D E Assume: = 1, α = 1/2
44 Problems with TD Value Learning TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages However, if we want to turn values into a (new) policy, we re sunk: a s, a s Idea: learn Q-values, not values Makes action selection model-free too! s,a,s s
45 Detour: Q-Value Iteration Value iteration: find successive (depth-limited) values Start with V 0 (s) = 0, which we know is right Given V k, calculate the depth k+1 values for all states: But Q-values are more useful, so compute them instead Start with Q 0 (s,a) = 0, which we know is right Given Q k, calculate the depth k+1 q-values for all q-states:
46 Q-Learning Q-Learning: sample-based Q-value iteration Learn Q(s,a) values as you go Receive a sample (s,a,s,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: [Demo: Q-learning gridworld (L10D2)] [Demo: Q-learning crawler (L10D3)]
47 Q-Learning Properties Amazing result: Q-learning converges to optimal policy -- even if you re acting suboptimally! This is called off-policy learning Technical conditions for convergence: Explore enough: Eventually try every state-action pair infinitely often Decay the learning rate properly t t = and t t2 < t = O(1/t) meets these conditions
48 Video of Demo Q-Learning -- Gridworld
49 Video of Demo Q-Learning -- Crawler
50 Active Reinforcement Learning
51 Active Reinforcement Learning Full reinforcement learning: You don t know the transition model You don t know the reward function You choose the actions now Goal: learn the optimal policy / values In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and find out what happens
52 Approaches to reinforcement learning 1. Learn the model, solve it, execute the solution 2. Learn values from experiences a. Direct utility estimation Add up rewards obtained from each state onwards b. Temporal difference (TD) learning Adjust V(s) to agree better with R(s,a,s )+γv (s ) Still need a transition model to make decisions c. Q-learning Like TD learning, but learn Q(s,a) instead Sufficient to make decisions without a model! Important idea: use samples to approximate the weighted sum over next-state values
53 TD learning and Q-learning Approximate version of value iteration update gives TD learning: V k+1(s) s P(s (s),s) [R(s, (s),s ) + γv k (s ) ] V (s) (1- ) V (s) + [R(s, (s),s ) + γv (s ) ] Approximate version of Q iteration update gives Q learning: Q k+1 (s,a) s P(s a,s) [R(s,a,s ) + γ max a Q k (s,a ) ] Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a ) ] We obtain a policy from learned Q, with no model!
54 Video of Demo Q-Learning Auto Cliff Grid
55 Exploration vs. Exploitation
56 Exploration vs exploitation Exploration: try new things Exploitation: do what s best given what you ve learned so far Key point: pure exploitation often gets stuck in a rut and never finds an optimal policy! 64
57 Exploration method 1: -greedy -greedy exploration Every time step, flip a biased coin With (small) probability, act randomly With (large) probability 1-, act on current policy Properties of -greedy exploration Every s,a pair is tried infinitely often Does a lot of stupid things Jumping off a cliff lots of times to make sure it hurts Keeps doing stupid things for ever Decay towards 0
58 Video of Demo Q-learning Epsilon-Greedy Crawler
59 Sensible exploration: Bandits A B C D Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 Which one-armed bandit to try next? Most people would choose C > B > A > D Basic intuition: higher mean is better; more uncertainty is better Gittins (1979): rank arms by an index that depends only on the arm itself 67
60 Exploration Functions Exploration functions implement this tradeoff Takes a value estimate u and a visit count n, and returns an optimistic utility, e.g., f(u,n) = u + k/ n Regular Q-update: Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a) ] Modified Q-update: Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a f(q (s,a),n(s,a )) ] Note: this propagates the bonus back to states that lead to unknown states as well! [Demo: exploration Q-learning crawler exploration function (L11D4)]
61 Video of Demo Q-learning Exploration Function Crawler
62 Optimality and exploration total reward per trial optimal exploration function regret decay -greedy fixed -greedy number of trials 70
63 Regret Regret measures the total cost of your youthful errors made while exploring and learning instead of behaving optimally Minimizing regret goes beyond learning to be optimal it requires optimally learning to be optimal
64 Approximate Q-Learning
65 Generalizing Across States Basic Q-Learning keeps a table of all Q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the Q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations Can we apply some machine learning tools to do this? [demo RL pacman]
66 Example: Pacman Let s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!
67 Video of Demo Q-Learning Pacman Tiny Watch All
68 Video of Demo Q-Learning Pacman Tiny Silent Train
69 Video of Demo Q-Learning Pacman Tricky Watch All
70 Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost f GST Distance to closest dot Number of ghosts 1 / (dist to closest dot) f DOT Is Pacman in a tunnel? (0/1) etc. Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
71 Linear Value Functions We can express V and Q (approximately) as weighted linear functions of feature values: V w (s) = w 1 f 1 (s) + w 2 f 2 (s) + + w n f n (s) Q w (s,a) = w 1 f 1 (s,a) + w 2 f 2 (s,a) + + w n f n (s,a) Important: depending on the features used, the best possible approximation may be terrible! But in practice we can compress a value function for chess (10 43 states) down to about 30 weights and get decent play!!!
72 Updating a linear value function Original Q learning rule tries to reduce prediction error at s,a: Q(s,a) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Instead, we update the weights to try to reduce the error at s,a: w i w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Q w (s,a)/ w i = w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] f i (s,a) Qualitative justification: Pleasant surprise: increase weights on +ve features, decrease on ve ones Unpleasant surprise: decrease weights on +ve features, increase on ve ones 80
73 Example: Q-Pacman [Demo: approximate Q- learning pacman (L11D10)]
74 Video of Demo Approximate Q-Learning -- Pacman
75 Convergence* Let V L be the closest linear approximation to V*. TD learning with a linear function approximator converges to some V that is pretty close to V L Q-learning with a linear function approximator may diverge With much more complicated update rules, stronger convergence results can be proved even for nonlinear function approximators such as neural nets 83
76 Nonlinear function approximators We can still use the gradient-based update for any Q w : w i w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Q w (s,a)/ w i Neural network error back-propagation already does this! Maybe we can get much better V or Q approximators using a complicated neural net instead of a linear function 84
77 Backgammon 85
78 TDGammon 4-ply lookahead using V(s) trained from 1,500,000 games of self-play 3 hidden layers, ~100 units each Input: contents of each location plus several handcrafted features Experimental results: Plays approximately at parity with world champion Led to radical changes in the way humans play backgammon 86
79 DeepMind DQN Used a deep learning network to represent Q: Input is last 4 images (84x84 pixel values) plus score 3 hidden layers: 16 units, each taking input form an 8x8 region, replicated 400 times 32 units, each taking input from a 4x4 region from layer 1, replicated 100 times 256 units fully connected to all units from layer 2 Trained on 7 Atari video games: Beam Rider, Breakout, Enduro Pong, Q*bert, Seaquest, Space Invaders 87
80 88
81 DeepMind (DQN) results The games Q*bert, Seaquest, Space Invaders, on which we are far from human performance, are more challenging because they require the network to find a strategy that extends over long time scales. 89
82 RL and dopamine Dopamine signal generated by parts of the striatum Encodes predictive error in value function (as in TD learning) 90
83 RL and Flappy Bird State space Discretized vertical distance from lower pipe Discretized horizontal distance from next pair of pipes Life: Dead or Living Actions Click Do nothing Rewards +1 if Flappy Bird still alive if Flappy Bird is dead 6-7 hours of Q-learning
ARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationApproximate Q-Learning. Dan Weld / University of Washington
Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationReinforcement Learning Wrap-up
Reinforcement Learning Wrap-up Slides courtesy of Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationCS230: Lecture 9 Deep Reinforcement Learning
CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationReview: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]
Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationMachine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel
Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationQ-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava
Q-learning Tutorial CSC411 Geoffrey Roeder Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava Tutorial Agenda Refresh RL terminology through Tic Tac Toe Deterministic
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationCourse basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.
Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationReinforcement Learning. Summer 2017 Defining MDPs, Planning
Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationCS 188: Artificial Intelligence. Outline
CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class
More informationRL 3: Reinforcement Learning
RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 18: HMMs and Particle Filtering 4/4/2011 Pieter Abbeel --- UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationAnnouncements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning
CS 188: Artificial Intelligence Spring 21 Lecture 22: Nearest Neighbors, Kernels 4/18/211 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!) Remaining
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationReinforcement Learning (1)
Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationIntroduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming
Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic
More information