CS 188: Artificial Intelligence

Size: px
Start display at page:

Download "CS 188: Artificial Intelligence"

Transcription

1 CS 188: Artificial Intelligence Reinforcement Learning Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein and Pieter

2 Reinforcement Learning

3 Double Bandits

4 Double-Bandit MDP Actions: Blue, Red States: Win, Lose $1 W 0.75 $ $ $0 L $1 No discount 10 time steps Both states have the same value $2 1.0

5 Offline Planning Solving MDPs is offline planning You determine all quantities through computation You need to know the details of the MDP You do not actually play the game! No discount 10 time steps Both states have the same value Play Red Value 15 $1 W 0.75 $ $ $0 L $1 Play Blue $2 1.0

6 Let s Play! $2 $2 $0 $2 $2 $2 $2 $0 $0 $0

7 Online Planning Rules changed! Red s win chance is different.?? $0 $1 1.0 W?? $2?? $2?? $0 L $1 1.0

8 Let s Play! $0 $0 $0 $2 $0 $2 $0 $0 $0 $0

9 What Just Happened? That wasn t planning, it was learning! Specifically, reinforcement learning There was an MDP, but you couldn t solve it with just computation You needed to actually act to figure it out Important ideas in reinforcement learning that came up Exploration: you have to try unknown actions to get information Exploitation: eventually, you have to use what you know Regret: even if you learn intelligently, you make mistakes Sampling: because of chance, you have to try things repeatedly Difficulty: learning can be much harder than solving a known MDP

10 Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Still looking for a policy (s) New twist: don t know T or R I.e. we don t know which states are good or what the actions do Must actually try actions and states out to learn

11 Reinforcement Learning Agent State: s Reward: r Actions: a Environment Basic idea: Learn how to maximize expected rewards based on observed samples of transitions

12 Example: Learning to Walk Initial After Learning [1K Trials] [Kohl and Stone, ICRA 2004]

13 Example: Learning to Walk [Kohl and Stone, ICRA 2004] Initial [Video: AIBO WALK initial]

14 Example: Learning to Walk [Kohl and Stone, ICRA 2004] Finished [Video: AIBO WALK finished]

15 Example: Sidewinding [Andrew Ng] [Video: SNAKE climbstep+sidewinding]

16 Example: Toddler Robot [Tedrake, Zhang and Seung, 2005] [Video: TODDLER 40s]

17 The Crawler! [Demo: Crawler Bot (L10D1)] [You, in Project 3]

18 Video of Demo Crawler Bot

19 DeepMind Atari ( Two Minute Lectures) 19

20 Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s S A set of actions (per state) A A model P(s a,s) A reward function R(s,a,s ) Still looking for a policy (s) New twist: don t know P or R I.e. we don t know which states are good or what the actions do Must actually try actions and explore new states -- to boldly go where no Pacman agent has been before

21 Offline (MDPs) vs. Online (RL) Offline Solution Online Learning

22 Model-Based Learning

23 Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s for each s, a Normalize to give an estimate of P(s s, a) Discover each R(s,a,s ) when we experience the transition Step 2: Solve the learned MDP For example, use value or policy iteration, as before

24 Example: Model-Based Learning Input Policy A B C D E Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10 Learned Model T(s,a,s ). P(s s,a) P(C B, east) = 1.00 P(D C, east) = 0.75 P(A C, east) = 0.25 R(s,a,s ). R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10

25 Pro: Makes efficient use of experiences Pros and cons Con: May not scale to large state spaces Learns model one state-action pair at a time (but this is fixable) Cannot solve MDP for very large S

26 Model-Free Learning

27 Basic idea of model-free methods To approximate expectations with respect to a distribution, you can either Estimate the distribution from samples, compute an expectation Or, estimate the expectation from samples directly

28 Example: Expected Age Goal: Compute expected age of cs188 students Known P(A) E[A] = a P(a) a = 0.35 x 20 + Without P(A), instead collect samples [a 1, a 2, a N ] Unknown P(A): Model Based Unknown P(A): Model Free Why does this work? Because eventually you learn the right model. P^(A) = N a /N E[A] a P^ (a) a E[A] 1/N i a i Why does this work? Because samples appear with the right frequencies.

29 Passive Reinforcement Learning

30 Passive Reinforcement Learning Simplified task: policy evaluation Input: a fixed policy (s) You don t know the transitions P(s s,a) You don t know the rewards R(s,a,s ) Goal: learn the state values V (s) In this case: Learner is along for the ride No choice about what actions to take: just do it This is NOT offline planning! Agent takes actions in the world.

31 Direct utility estimation Goal: Estimate V (s), i.e., expected total discounted reward from s onwards Idea: Use the actual sum of discounted rewards from s Average over multiple trials and visits to s This is called direct utility estimation

32 Example: Direct Evaluation Input Policy A B C D E Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10 Output Values -10 A B C D E -2

33 Problems with Direct Evaluation What s good about direct evaluation? It s easy to understand It doesn t require any knowledge of P(s s,a) or R(s,a,s ) It converges to the right answer in the limit What s bad about it? Each state must be learned separately (fixable) It ignores information about state connections So, it takes a long time to learn Output Values -10 A B C D E -2 If B and E both go to C under this policy, how can their values be different?

34 Why Not Use Policy Evaluation? Simplified Bellman updates calculate V for a fixed policy: Each round, replace V with a one-step-look-ahead layer over V This approach fully exploited the connections between the states Unfortunately, we need T and R to do it! s, (s),s s (s) s, (s) s Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights?

35 Sample-Based Policy Evaluation? We want to improve our estimate of V by computing these averages: Idea: Take samples of outcomes s (by doing the action!) and average s (s) s, (s) s, (s),s s 2' s' s 1 ' ' s 3 Almost! But we can t rewind time to get sample after sample from state s.

36 Temporal difference (TD) learning

37 TD as approximate Bellman update Policy evaluation (version 1) improves the estimate of V by computing Bellman update, i.e., expectation over nextstate values: V k+1(s) s P(s (s),s) [R(s, (s),s ) + γv k (s ) ] Idea 1: Use actual samples to estimate the expectation: sample 1 = R(s, (s),s 1 ) + γv k (s 1 ) sample 2 = R(s, (s),s 2 ) + γv k (s 2 ) sample N = R(s, (s),s N ) + γv k (s N ) V k+1(s) 1/N i sample i

38 TD as approximate Bellman update Idea 2: Update value of s after each transition s,a,s,r : Update V ([3,1]) based on R([3,1],up,[3,2]) and γv ([3,2]) Update V ([3,2]) based on R([3,2],up,[3,3]) and γv ([3,3]) Update V ([3,3]) based on R([3,3],right,[4,3]) and γv ([4,3])

39 TD as approximate Bellman update Idea 3: Update values by maintaining a running average

40 Running averages How do you compute the average of 1, 4, 7? Method 1: add them up and divide by N = 12 average = 12/N = 12/3 = 4 Method 2: keep a running average n and a running count n n=0 0 =0 n=1 1 = (0 0 + x 1 )/1 = ( )/1 = 1 n=2 2 = (1 1 + x 2 )/2 = ( )/2 = 2.5 n=3 3 = (2 2 + x 3 )/3 = ( )/3 = 4 General formula: n = ((n-1) n-1 + x n )/n = [(n-1)/n] n-1 + [1/n] x n (weighted average of old mean, new sample)

41 Running averages contd. What if we use a weighted average with a fixed weight? n = (1- ) n-1 + x n n=1 1 = x 1 n=2 2 = (1- ) 1 + x 2 = (1- ) x 1 + x 2 n=3 3 = (1- ) 2 + x 3 = (1- ) 2 x 1 + (1- )x 2 + x 3 n=4 4 = (1- ) 3 + x 4 = (1- ) 3 x 1 + (1- ) 2 x 2 + (1- )x 3 + x 4 I.e., exponential forgetting of old values

42 TD as approximate Bellman update Idea 3: Update values by maintaining a running average sample = R(s, (s),s ) + γv (s ) V (s) (1- ) V (s) + sample V (s) V (s) + [sample - V (s)] This is the temporal difference learning rule [sample - V (s)] is the TD error I.e., observe a sample, move V (s) a little bit to make it more consistent with its neighbor V (s )

43 Example: Temporal Difference Learning States Observed Transitions B, east, C, -2 C, east, D, -2 A B C D E Assume: = 1, α = 1/2

44 Problems with TD Value Learning TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages However, if we want to turn values into a (new) policy, we re sunk: a s, a s Idea: learn Q-values, not values Makes action selection model-free too! s,a,s s

45 Detour: Q-Value Iteration Value iteration: find successive (depth-limited) values Start with V 0 (s) = 0, which we know is right Given V k, calculate the depth k+1 values for all states: But Q-values are more useful, so compute them instead Start with Q 0 (s,a) = 0, which we know is right Given Q k, calculate the depth k+1 q-values for all q-states:

46 Q-Learning Q-Learning: sample-based Q-value iteration Learn Q(s,a) values as you go Receive a sample (s,a,s,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: [Demo: Q-learning gridworld (L10D2)] [Demo: Q-learning crawler (L10D3)]

47 Q-Learning Properties Amazing result: Q-learning converges to optimal policy -- even if you re acting suboptimally! This is called off-policy learning Technical conditions for convergence: Explore enough: Eventually try every state-action pair infinitely often Decay the learning rate properly t t = and t t2 < t = O(1/t) meets these conditions

48 Video of Demo Q-Learning -- Gridworld

49 Video of Demo Q-Learning -- Crawler

50 Active Reinforcement Learning

51 Active Reinforcement Learning Full reinforcement learning: You don t know the transition model You don t know the reward function You choose the actions now Goal: learn the optimal policy / values In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and find out what happens

52 Approaches to reinforcement learning 1. Learn the model, solve it, execute the solution 2. Learn values from experiences a. Direct utility estimation Add up rewards obtained from each state onwards b. Temporal difference (TD) learning Adjust V(s) to agree better with R(s,a,s )+γv (s ) Still need a transition model to make decisions c. Q-learning Like TD learning, but learn Q(s,a) instead Sufficient to make decisions without a model! Important idea: use samples to approximate the weighted sum over next-state values

53 TD learning and Q-learning Approximate version of value iteration update gives TD learning: V k+1(s) s P(s (s),s) [R(s, (s),s ) + γv k (s ) ] V (s) (1- ) V (s) + [R(s, (s),s ) + γv (s ) ] Approximate version of Q iteration update gives Q learning: Q k+1 (s,a) s P(s a,s) [R(s,a,s ) + γ max a Q k (s,a ) ] Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a ) ] We obtain a policy from learned Q, with no model!

54 Video of Demo Q-Learning Auto Cliff Grid

55 Exploration vs. Exploitation

56 Exploration vs exploitation Exploration: try new things Exploitation: do what s best given what you ve learned so far Key point: pure exploitation often gets stuck in a rut and never finds an optimal policy! 64

57 Exploration method 1: -greedy -greedy exploration Every time step, flip a biased coin With (small) probability, act randomly With (large) probability 1-, act on current policy Properties of -greedy exploration Every s,a pair is tried infinitely often Does a lot of stupid things Jumping off a cliff lots of times to make sure it hurts Keeps doing stupid things for ever Decay towards 0

58 Video of Demo Q-learning Epsilon-Greedy Crawler

59 Sensible exploration: Bandits A B C D Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 Which one-armed bandit to try next? Most people would choose C > B > A > D Basic intuition: higher mean is better; more uncertainty is better Gittins (1979): rank arms by an index that depends only on the arm itself 67

60 Exploration Functions Exploration functions implement this tradeoff Takes a value estimate u and a visit count n, and returns an optimistic utility, e.g., f(u,n) = u + k/ n Regular Q-update: Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a) ] Modified Q-update: Q(s,a) (1- ) Q(s,a) + [R(s,a,s ) + γ max a f(q (s,a),n(s,a )) ] Note: this propagates the bonus back to states that lead to unknown states as well! [Demo: exploration Q-learning crawler exploration function (L11D4)]

61 Video of Demo Q-learning Exploration Function Crawler

62 Optimality and exploration total reward per trial optimal exploration function regret decay -greedy fixed -greedy number of trials 70

63 Regret Regret measures the total cost of your youthful errors made while exploring and learning instead of behaving optimally Minimizing regret goes beyond learning to be optimal it requires optimally learning to be optimal

64 Approximate Q-Learning

65 Generalizing Across States Basic Q-Learning keeps a table of all Q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the Q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations Can we apply some machine learning tools to do this? [demo RL pacman]

66 Example: Pacman Let s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!

67 Video of Demo Q-Learning Pacman Tiny Watch All

68 Video of Demo Q-Learning Pacman Tiny Silent Train

69 Video of Demo Q-Learning Pacman Tricky Watch All

70 Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost f GST Distance to closest dot Number of ghosts 1 / (dist to closest dot) f DOT Is Pacman in a tunnel? (0/1) etc. Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

71 Linear Value Functions We can express V and Q (approximately) as weighted linear functions of feature values: V w (s) = w 1 f 1 (s) + w 2 f 2 (s) + + w n f n (s) Q w (s,a) = w 1 f 1 (s,a) + w 2 f 2 (s,a) + + w n f n (s,a) Important: depending on the features used, the best possible approximation may be terrible! But in practice we can compress a value function for chess (10 43 states) down to about 30 weights and get decent play!!!

72 Updating a linear value function Original Q learning rule tries to reduce prediction error at s,a: Q(s,a) Q(s,a) + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Instead, we update the weights to try to reduce the error at s,a: w i w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Q w (s,a)/ w i = w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] f i (s,a) Qualitative justification: Pleasant surprise: increase weights on +ve features, decrease on ve ones Unpleasant surprise: decrease weights on +ve features, increase on ve ones 80

73 Example: Q-Pacman [Demo: approximate Q- learning pacman (L11D10)]

74 Video of Demo Approximate Q-Learning -- Pacman

75 Convergence* Let V L be the closest linear approximation to V*. TD learning with a linear function approximator converges to some V that is pretty close to V L Q-learning with a linear function approximator may diverge With much more complicated update rules, stronger convergence results can be proved even for nonlinear function approximators such as neural nets 83

76 Nonlinear function approximators We can still use the gradient-based update for any Q w : w i w i + [R(s,a,s ) + γ max a Q (s,a ) - Q(s,a) ] Q w (s,a)/ w i Neural network error back-propagation already does this! Maybe we can get much better V or Q approximators using a complicated neural net instead of a linear function 84

77 Backgammon 85

78 TDGammon 4-ply lookahead using V(s) trained from 1,500,000 games of self-play 3 hidden layers, ~100 units each Input: contents of each location plus several handcrafted features Experimental results: Plays approximately at parity with world champion Led to radical changes in the way humans play backgammon 86

79 DeepMind DQN Used a deep learning network to represent Q: Input is last 4 images (84x84 pixel values) plus score 3 hidden layers: 16 units, each taking input form an 8x8 region, replicated 400 times 32 units, each taking input from a 4x4 region from layer 1, replicated 100 times 256 units fully connected to all units from layer 2 Trained on 7 Atari video games: Beam Rider, Breakout, Enduro Pong, Q*bert, Seaquest, Space Invaders 87

80 88

81 DeepMind (DQN) results The games Q*bert, Seaquest, Space Invaders, on which we are far from human performance, are more challenging because they require the network to find a strategy that extends over long time scales. 89

82 RL and dopamine Dopamine signal generated by parts of the striatum Encodes predictive error in value function (as in TD learning) 90

83 RL and Flappy Bird State space Discretized vertical distance from lower pipe Discretized horizontal distance from next pair of pipes Life: Dead or Living Actions Click Do nothing Rewards +1 if Flappy Bird still alive if Flappy Bird is dead 6-7 hours of Q-learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement Learning Wrap-up

Reinforcement Learning Wrap-up Reinforcement Learning Wrap-up Slides courtesy of Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))] Review: TD-Learning function TD-Learning(mdp) returns a policy Class #: Reinforcement Learning, II 8s S, U(s) =0 set start-state s s 0 choose action a, using -greedy policy based on U(s) U(s) U(s)+ [r

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava Q-learning Tutorial CSC411 Geoffrey Roeder Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava Tutorial Agenda Refresh RL terminology through Tic Tac Toe Deterministic

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage. Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 18: HMMs and Particle Filtering 4/4/2011 Pieter Abbeel --- UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning CS 188: Artificial Intelligence Spring 21 Lecture 22: Nearest Neighbors, Kernels 4/18/211 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!) Remaining

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic

More information