Reinforcement Learning
|
|
- Allyson Walton
- 6 years ago
- Views:
Transcription
1 Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University of Stuttgart
2 Outline Model-Based RL Exploration/Exploitation PAC-MDP E 3, RMAX Bayes optimal: Bayesian model-based RL 2/??
3 RL Approaches 3/??
4 Model-based RL Model learning: Given data D = {(s t, a t, r t, s t+1 )} H t=1 estimate P (s a, s) and P (r a, s) discrete state-action: ˆP (s a, s) = #(s,a,s) #(a,s) continuous state-action: ˆP (s a, s) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression (including non-linear features, regularization, cross-validation!) Planning: discrete state-action: Dynamic Programming with the estimated model continuous state-action: Differential Dynamic Programming, Planning-by-Inference, etc. 4/??
5 Exploration vs. Exploitation 5/??
6 Exploration: Motivation Start: many potential paths 6/??
7 Exploration: Motivation Continue using this path? Try out alternatives? 7/??
8 Exploration: Motivation 8/??
9 Exploration: Motivation 9/??
10 Reinforcement learning In reinforcement learning (RL), the agent starts to act without a model of the environment. The agent has to learn from its experience what to do to in order to fulfill tasks and achieve high rewards. RL algorithms we have seen thus far: Q-learning, TD-learning, policy gradient,... 10/??
11 Reinforcement learning In reinforcement learning (RL), the agent starts to act without a model of the environment. The agent has to learn from its experience what to do to in order to fulfill tasks and achieve high rewards. RL algorithms we have seen thus far: Q-learning, TD-learning, policy gradient,... Note the difference to the problem of adapting the behavior based on a given model (also called planning / solving an MDP / calculating optimal state and action values). This is a computational subproblem in model-based RL. Planning algorithms we have seen thus far: value iteration, policy iteration. 10/??
12 Exploration / Exploitation In contrast to supervised learning, in RL the data used for learning depend on the agent. Two different types of behavior: exploration: behave with the goal to learn as much as possible exploitation: behave with the goal of getting as much reward as possible Challenge in exploration: which actions will lead to the most important learning progress with respect to the goal? exploration as fundamental intelligent behavior 11/??
13 Recall Markov Decision Processes a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 P (s 0:T, a 0:T, r 0:T ; π) = P (s 0 )P (a 0 s 0 ; π)p (r 0 a 0, s 0 ) T t=1 P (st a t-1, s t-1 )P (a t s t; π)p (r t a t, s t) world s initial state distribution P (s 0 ) world s transition probabilities P (s t+1 a t, s t ) world s reward probabilities P (r t a t, s t ) agent s policy π(a t s t ) (or deterministic a t = π(s t )) discount parameter γ for future rewards two different sources of uncertainty: the world itself (not controlled by the agent) vs. the policy (controlled by the agent) 12/??
14 Exploration-exploitation tradeoff Goal of reinforcement learning agent: maximize future rewards E[ t=0 γt r t s 0 ; π] However, the agent does not know the transition parameters P (s t+1 a t, s t ) and reward parameters P (r t a t, s t ) of the MDP. Rather, the agent needs to learn from its experience s 0, a 0, r 0, s 1, a 1, r 1,... which actions will lead to high rewards. 13/??
15 Exploration-exploitation tradeoff Goal of reinforcement learning agent: maximize future rewards E[ t=0 γt r t s 0 ; π] However, the agent does not know the transition parameters P (s t+1 a t, s t ) and reward parameters P (r t a t, s t ) of the MDP. Rather, the agent needs to learn from its experience s 0, a 0, r 0, s 1, a 1, r 1,... which actions will lead to high rewards. Exploration-exploitation tradeoff: Which policy π(a t s t ) for action selection shall the agent follow so that it does not miss the high-reward states, but does not spend too much time in low-reward states, either? Exploitation: Prefer actions a t which have led to reward before? Exploration: Or rather take actions to learn more about the unknown MDP parameters and potentially find states with higher reward? 13/??
16 Notion of Optimality 14/??
17 Sample Complexity Let M be an MDP with N states, K actions, discount factor γ [0, 1) and a maximal reward R max > 0. Let A be an algorithm (that is, a reinforcement learning agent) that acts in the environment, resulting in s 0, a 0, r 0, s 1, a 1, r 1,.... Let V A t,m = E[ i=0 γi r t+i s 0, a 0, r 0... s t 1, a t 1, r t 1, s t ]. V is the value function of the optimal policy. Define an accuracy threshold ɛ: ˆV V ɛ 15/??
18 Definition: Let ɛ > 0 be a prescribed accuracy and δ > 0 be an allowed probability of failure. The expression η(ɛ, δ, N, K, γ, R max ) is a sample complexity bound for algorithm A if independently of the choice of s 0, with probability at least 1 δ, the number of timesteps such that V A t,m < V (s t ) ɛ is at most η(ɛ, δ, N, K, γ, R max ). (Kakade, 2003) 16/??
19 Efficient exploration An algorithm with sample complexity that is polynomial in 1/ɛ, log(1/δ), N, K, 1/(1 γ), R max is called PAC-MDP (probably approximately correct in MDPs). 17/??
20 Exploration strategies 18/??
21 Exploration strategies The exploration strategy is reflected in the policy π(s). In the following, assume we have estimates ˆQ(s, a). greedy (only exploit): π(s) = argmax a ˆQ(s, a) problem: learned model not the same as the true environment Without exploration, agent is likely to miss high rewards. random: choose action a with probability 1/{ actions} problem: ignores value estimates and thus rewards 19/??
22 Exploration strategies (continued) argmax a ˆQ(s, a) with probability 1 ɛ ɛ-greedy: π(s) = random action with probability ɛ most popular method Converges to the optimal value function with probability 1 (all paths will be visited sooner or later), if the exploration rate diminishes according to an appropriate schedule. problem: sample complexity exponential in the number of states exp( Boltzmann: choose action a with probability ˆQ(s,a)/T ) a exp( ˆQ(s,a)/T ) temperature T controls amount of exploration problem again: sample complexity exponential in the number of states 20/??
23 Exploration strategies (continued) Other heuristics for exploration: minimize variance of action value estimates optimistic initial values ( optimism in the face of uncertainty ) state bonuses: frequency, recency, error etc. problem again: sample complexity exponential in the number of states Bayesian RL: optimal exploration strategy distribution over MDP models (i.e., the parameters of the MDP) posterior distribution updated after each new observation exploration strategy minimizes uncertainty of parameters Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., no other policy is better in expectation w.r.t. prior distribution over MDPs) only tractable for very simple problems E 3 and R max : principled approach to the exploration-exploitation tradeoff with polynomial sample complexity 21/??
24 PAC-MDP Algorithms Explicit-Exploit-or-Explore (E3) RMAX 22/??
25 PAC-MDP Algorithms Explicit-Exploit-or-Explore (E3) RMAX Common Instuition: optimism in the face of uncertainty (i.e. if faced the option of certain and uncertain reward regions, try the UNCERTAIN reward region) 22/??
26 Explicit-Exploit-or-Explore (E3) algorithm 23/??
27 Explicit-Exploit-or-Explore (E3) algorithm Kearns and Singh (2002) Model-based approach with polynomial sample complexity (PAC-MDP) uses optimism in the face of uncertainty assumes knowledge of maximum reward Maintains counts for states and actions to quantify confidence in model estimates A state s is known if all actions in s have been sufficiently often executed. 24/??
28 Explicit-Exploit-or-Explore (E3) algorithm Kearns and Singh (2002) Model-based approach with polynomial sample complexity (PAC-MDP) uses optimism in the face of uncertainty assumes knowledge of maximum reward Maintains counts for states and actions to quantify confidence in model estimates A state s is known if all actions in s have been sufficiently often executed. From observed data, E3 constructs two MDPs: MDP known : includes known states with (approximately exact) estimates of P (s t+1 a t, s t ) and P (r t a t, s t ) model which captures what you know MDP unknown : MDP known + special state s where the agent receives maximum reward model which drives exploration 24/??
29 E3 sketch Input: State s Output: Action a if s is known then Plan in MDP known Sufficiently accurate model estimates if resulting plan has value above some threshold then return first action of plan Exploitation else Plan in MDP unknown return first action of plan Planned exploration else return action with the least observations in s Direct exploration 25/??
30 E3 example S. Singh (Tutorial 2005) 26/??
31 E3 example S. Singh (Tutorial 2005) 27/??
32 E3 example M : true known state MDP S. Singh (Tutorial 2005) M : estimated known state MDP 28/??
33 E 3 T is the time horizon. Implementation Setting G T max is the maximum T-step return. (discounted case G T max T R max ). ( ) A state is known if it was visited O (NT G T max/ɛ) 4 V ar max log(1/δ) times. (V ar max is the maximum variance of the random payoffs over all states). For the exploration/exploitation choice at known states: It s assumed to be given the optimal value function V. If ˆV obtained from the MDP known > (V ɛ) then do exploitation. 29/??
34 RMAX Algorithm 30/??
35 RMAX R-MAX solves only one unique model (don t separate MDP known and MDP unknown ) and therefore implicitly explores or exploits. The R-MAX and E3 algorithms were able to achieve roughly the same level of performance (Strehl s thesis). 31/??
36 RMAX R-MAX solves only one unique model (don t separate MDP known and MDP unknown ) and therefore implicitly explores or exploits. The R-MAX and E3 algorithms were able to achieve roughly the same level of performance (Strehl s thesis). RMAX builds an approximate MDP based on reward function ˆR(s, a) R(s, a) = R max if (s,a) known otherwise 31/??
37 RMAX s sketch Initialize all couter n(s, a) = 0, n(s, a, s ) = 0. Initialize ˆT (s s, a) = I s=s, ˆR(s, a) = R max while (1) do Compute policy π t using MDP model of ( ˆT, ˆR). Choose a = π t(s), observe s, r. n(s, a) = n(s, a) + 1 r(s, a) = r(s, a) + r n(s, a, s ) = n(s, a, s ) + 1 if n(s, a) = m then Update ˆT ( s, a) = n(s, a, )/n(s, a), and ˆR(s, a) = r(s, a)/n(s, a). 32/??
38 RMAX Analysis The general PAC-MDP theorem does not easily adapt to the analysis of E3 because of E3 s use of two internal models (Original analysis depends on horizon and mixing time). ( (Upper bound) There exists m = O NT 2 ɛ 2 of at least 1 δ, V (s t ) V (s t ) ɛ is true for all but ( N 2 KT 3 O ɛ 3 where N is the number of states. For discounted case: T = log ɛ 1 γ ) ln 2 NK δ, then with probability ) ln 2 NK δ 33/??
39 Limitations of E3 and RMAX E3/RMAX is called efficient because its sample complexity scales only polynomially in the number of states. In natural environments, however, this number of states is enormous: it is exponential in the number of objects in the environment. Hence E3/RMAX scales exponentially in the number of objects. Generalization over states and actions is crucial for exploration. 34/??
40 KWIK KWIK (Known What It Knows): a supervised-learning model 35/??
41 KWIK KWIK (Known What It Knows): a supervised-learning model Input set: X Output set: Y Observation set: Z Hypothesis class: H : X Y Target function: h H Special symbol:? (I don t know) 36/??
42 KWIK 37/??
43 KWIK Example: Coin Learning Predict P r(head) [0, 1] for a coin from many observations: head or tail Algorithm Predict? the first O(1/ɛ 2 log(1/δ)) times Use empirical estimate afterwards The bound follows from Hoeffdings bound O(1/ɛ 2 log(1/δ)) Li et. al. ICML /??
44 KWIK-RMAX T (s s, a) and R(s, a) are TWIK-learned. 39/??
45 Conclusions RL agents need to solve the exploration-exploitation tradeoff. Sample complexity measures the required number of explorative actions of an algorithm. Ideas for driving exploration: random actions, optimism in the face of uncertainty, maximizing learning progress and information gain 40/??
46 References Kakade (2003): On the sample complexity of reinforcement learning. PhD thesis. Poupart, Vlassis, Hoey, Kevin Regan: An analytic solution to discrete Bayesian reinforcement learning. ICML Li, Littman, Walsh: Knows what it knows: a framework for self-aware learning. ICML /??
Reinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationAn Analytic Solution to Discrete Bayesian Reinforcement Learning
An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1 Motivation Automated
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationOnline regret in reinforcement learning
University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T
More informationAn Analysis of Model-Based Interval Estimation for Markov Decision Processes
An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationA Theoretical Analysis of Model-Based Interval Estimation
Alexander L. Strehl Michael L. Littman Department of Computer Science, Rutgers University, Piscataway, NJ USA strehl@cs.rutgers.edu mlittman@cs.rutgers.edu Abstract Several algorithms for learning near-optimal
More informationJournal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes
Journal of Computer and System Sciences 74 (2008) 1309 1331 Contents lists available at ScienceDirect Journal of Computer and System Sciences www.elsevier.com/locate/jcss An analysis of model-based Interval
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationLearning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationLecture 2: Learning from Evaluative Feedback. or Bandit Problems
Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those
More informationReinforcement Learning
Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the
More informationPAC Model-Free Reinforcement Learning
Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationPROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING
PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING BY ALEXANDER L. STREHL A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationSample Complexity of Multi-task Reinforcement Learning
Sample Complexity of Multi-task Reinforcement Learning Emma Brunskill Computer Science Department Carnegie Mellon University Pittsburgh, PA 1513 Lihong Li Microsoft Research One Microsoft Way Redmond,
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationComparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning
Freie Universität Berlin Fachbereich Mathematik und Informatik Master Thesis Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning Michael Borst Advisor:
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationBayes-Adaptive POMDPs 1
Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering
More informationUniversity of Alberta
University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationEfficient Learning in Linearly Solvable MDP Models
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationNotes on Tabular Methods
Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from
More informationDueling Network Architectures for Deep Reinforcement Learning (ICML 2016)
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationAction Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
Journal of Machine Learning Research (2006 Submitted 2/05; Published 1 Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems Eyal Even-Dar evendar@seas.upenn.edu
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationBounded Optimal Exploration in MDP
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Bounded Optimal Exploration in MDP Kenji Kawaguchi Massachusetts Institute of Technology Cambridge, MA, 02139 kawaguch@mit.edu
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationModule 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationBayesian reinforcement learning and partially observable Markov decision processes November 6, / 24
and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More information1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a
3 MDP (2 points). (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a s P (s s, a)u(s ) where R(s) is the reward associated with being in state s. Suppose now
More informationEdward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.
Lecture 2: Learning from Evaluative Feedback Edward L. Thorndike #1874$1949% or!bandit Problems" Puzzle Box 1 2 Learning by Trial$and$Error Law of E&ect:!Of several responses to the same situation, those
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationActive Learning of MDP models
Active Learning of MDP models Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and François Charpillet Nancy Université / INRIA LORIA Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex France
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationRL 14: Simplifications of POMDPs
RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationCSC242: Intro to AI. Lecture 23
CSC242: Intro to AI Lecture 23 Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages Learning so far... Input Attributes Alt Bar Fri Hun Pat Price Rain Res
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More information