Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Similar documents
Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Five Whys How To Do It Better

Trigonometric Ratios Unit 5 Tentative TEST date

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

You need to be able to define the following terms and answer basic questions about them:

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Getting Involved O. Responsibilities of a Member. People Are Depending On You. Participation Is Important. Think It Through

Lecture 7: Damped and Driven Oscillations

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Computational modeling techniques

Sequential Allocation with Minimal Switching

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

NUMBERS, MATHEMATICS AND EQUATIONS

Kinetic Model Completeness

SPH3U1 Lesson 06 Kinematics

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Lab 1 The Scientific Method

We can see from the graph above that the intersection is, i.e., [ ).

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Differentiation Applications 1: Related Rates

AP Physics Kinematic Wrap Up

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Lab #3: Pendulum Period and Proportionalities

Corrections for the textbook answers: Sec 6.1 #8h)covert angle to a positive by adding period #9b) # rad/sec

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Physics 212. Lecture 12. Today's Concept: Magnetic Force on moving charges. Physics 212 Lecture 12, Slide 1

I. Analytical Potential and Field of a Uniform Rod. V E d. The definition of electric potential difference is

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Activity Guide Loops and Random Numbers

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Math Foundations 20 Work Plan

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Math 105: Review for Exam I - Solutions

, which yields. where z1. and z2

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Dataflow Analysis and Abstract Interpretation

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

Lecture 5: Equilibrium and Oscillations

Building Consensus The Art of Getting to Yes

Pattern Recognition 2014 Support Vector Machines

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Introduction to Models and Properties

Instructional Plan. Representational/Drawing Level

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

PHYS 314 HOMEWORK #3

ENSC Discrete Time Systems. Project Outline. Semester

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

In the OLG model, agents live for two periods. they work and divide their labour income between consumption and

AP Literature and Composition. Summer Reading Packet. Instructions and Guidelines

How do scientists measure trees? What is DBH?

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

x x

Associated Students Flacks Internship

Physics 2010 Motion with Constant Acceleration Experiment 1

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

CESAR Science Case The differential rotation of the Sun and its Chromosphere. Introduction. Material that is necessary during the laboratory

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

BASD HIGH SCHOOL FORMAL LAB REPORT

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

Professional Development. Implementing the NGSS: High School Physics

Department of Electrical Engineering, University of Waterloo. Introduction

Lim f (x) e. Find the largest possible domain and its discontinuity points. Why is it discontinuous at those points (if any)?

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

SUMMER REV: Half-Life DUE DATE: JULY 2 nd

1 Course Notes in Introductory Physics Jeffrey Seguritan

Churn Prediction using Dynamic RFM-Augmented node2vec

The steps of the engineering design process are to:

CHAPTER 6 -- ENERGY. Approach #2: Using the component of mg along the line of d:

Kepler's Laws of Planetary Motion

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Who is the Holy Spirit?

CS 188: Artificial Intelligence

Assessment Primer: Writing Instructional Objectives

Lab 11 LRC Circuits, Damped Forced Harmonic Motion

Introduction to Spacetime Geometry

Preparation work for A2 Mathematics [2017]

Plan o o. I(t) Divide problem into sub-problems Modify schematic and coordinate system (if needed) Write general equations

Why Don t They Get It??

INSTRUCTIONAL PLAN Day 2

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

1 PreCalculus AP Unit G Rotational Trig (MCR) Name:

INSTRUMENTAL VARIABLES

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

Engineering Decision Methods

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

TP1 - Introduction to ArcGIS

Experiment #3. Graphing with Excel

Relationships Between Frequency, Capacitance, Inductance and Reactance.

Materials Engineering 272-C Fall 2001, Lecture 7 & 8 Fundamentals of Diffusion

A Quick Overview of the. Framework for K 12 Science Education

Transcription:

Admin Reinfrcement Learning Cntent adapted frm Berkeley CS188 MDP Search Trees Each MDP state prjects an expectimax-like search tree Optimal Quantities The value (utility) f a state s: V*(s) = expected utility starting in s and acting ptimally The value (utility) f a q-state (s,a): Q*(s,a) = expected utility starting ut having taken actin a frm state s and (thereafter) acting ptimally The ptimal plicy: π*(s) = ptimal actin frm state s

Bellman Equatins Bellman Equatins Definitin f ptimal utility via expectimax recurrence gives a simple nestep lkahead relatinship amngst ptimal utility values V * (s) = max a Q* (s, a) Q* (s, a) = T (s, a, s')"# R(s, a, s') + γ V * (s')$% s' V (s) = max a T (s, a, s')"# R(s, a, s') + γ V * (s')$% * Value Iteratin Bellman equatins characterize the ptimal values: Value Iteratin Cnvergence* Value iteratin cmputes them: Value iteratin is just a fixed pint slutin methd. thugh the Vk vectrs are als interpretable as time-limited values s' The are Bellman equatins, and they characterize ptimal behavir in a way we ll use ver and ver Hw d we knw the Vk vectrs are ging t cnverge? Case 1: If the tree has maximum depth M, then VM hlds the actual untruncated values Case 2: If the discunt is less than 1 Sketch: fr any state Vk and Vk+1 can be viewed as depth k+1 expectimax results in nearly identical search trees The difference is that n the bttm later, Vk+1 has actual rewards while Vk has zers That last layer is at best all RMAX It is at wrst RMIN But everything is discunted by γk that far ut S Vk and Vk+1 are at mst γk max R different S as k increases, the values cnverge

Value Iteratin Cnvergence Plicy Lss [Dem] Plicy Iteratin Alternative apprach fr ptimal plicies: Step 1: Plicy evaluatin: calculate utilities fr sme fixed plicy (nt ptimal utilities!) until cnvergence Step 2: Plicy imprvement: update plicy using ne-step lk-ahead with resulting cnverged (but nt ptimal!) utilities as future values This is plicy iteratin It s still ptimal! Can cnverge (much) faster under sme cnditins Plicy Iteratin (EM) Evaluatin: Fr fixed current plicy π, find values with plicy evaluatin: Iterate until values cnverge: Imprvement: Fr fixed values, get a better plicy using plicy extractin One-step lk-ahead:

Plicy Iteratin Generalized Plicy Iteratin [Suttn and Bart] Plicy Iteratin Prf (Sketch) Duble Bandits Guaranteed t cnverge: In every step the plicy imprves (because therwise we return nce we get the same plicy twice). Thus every iteratin generates a new plicy. There are a finite number f plicies. In the wrst-case, we may have t iterate thrugh all (num actins) (num states) plicies befre we terminate. Optimality at Cnvergence: k+1 (s) = k (s) By definitin f cnvergence, This means: 8s, V k (s) = max a Ps T (s, a, 0 s0 )[R(s, a, s 0 )+ V k i (s 0 )] Thus, V k (s) satisfies the Bellman equatin, which means V k (s) is a fixed-pint pint slutin t the Bellman equatin, V (s)

Duble-Bandit MDP Actins: Blue, Red States: Win, Lse Offline Planning Slving MDPs is ffline planning Yu determine all quantities thrugh cmputatin Yu need t knw the details f the MDP Yu d nt actually play the game! Online Planning New rules! Red s win chance is different. Let s Play! $0 $0 $0 $2 $0 $2 $0 $0 $0

What Just Happened? Reinfrcement Learning That wasn t planning, it was learning! Specifically, reinfrcement learning There was an MDP, but yu culdn t slve it with just cmputatin Yu needed t actually act t figure it ut Reinfrcement Learning Example: Learning t Walk Basic idea: Receive feedback in the frm f rewards Agent s utility is defined by the reward functin Must (learn t) act s as t maximize expected rewards All learning is based n bserved samples f utcmes! Initial Training Finished [Khl and Stne, ICRA 2004]

Example: Learning t Walk The Crawler! [Tedrake, Zhang, and Seung 2005] [Yu, Prject 3] Reinfrcement Learning Offline (MDPs) vs. Online (RL) Still assume a Markv decisin prcess (MDP): A set f states s 2 S A set f actins (per state) A mdel T (s, a, s 0 ) A reward functin A R(s, a, s 0 ) Still lking fr a plicy (s) New twist: we dn t knw T r R! I.e., we dn t knw which states are gd r what the actins d Must actually try actins and states ut t learn

Quiz 1: Reinfrcement Learning Mdel-Based Learning The difference between planning in a knwn Markv Decisin Prcess and reinfrcement learning (RL) is that: In RL the agent desn t knw the transitin mdel T r the reward functin R. In RL the agent desn t knw what its current state is (e.g., desn t knw its wn psitin when acting in a gridwrld). A) T/T B) T/F C) F/T D) F/F Mdel-Based Learning Mdel-Based Idea: Learn an apprximate mdel based n experiences Slve fr values as if the learned mdel were crrect Step 1: Learn empirical MDP mdel Cunt utcmes s fr each s,a Nrmalize t give an estimate f T (s, a, s0 ) Discver each R (s, a, s0 ) when we experience (s,a,s ) Step 2: Slve the learned MDP Fr example, use value iteratin, as befre Example: Mdel-Based Learning

Example: Expected Age Gal: Cmpute expected age f CS 151 students Withut P(A), instead cllect samples [a 1,a 2, a N ] Try it! Get in a grup f 3-6 students. Based n yur sample, build a mdel fr the expected graduatin year f students in the class. Why des this wrk? Yu eventually learn the right mdel. Why des this wrk? Samples appear with the right frequencies. What s yur predictin? Quiz 2: Mdel-based Learning Quiz 2: Rapid-Fire Click-in T(A,suth,C)= T(B,east,C)= T(C,suth,E)= T(C,suth,D)= A) 1.0 B) 0.75 C) 0.5 D) 0.25 E) 0.0 What mdel wuld be learned frm the abve bserved episdes?

Mdel-Free Learning Passive Reinfrcement Learning Passive Reinfrcement Learning Direct Evaluatin Simplified task: plicy evaluatin Input: a fixed plicy π(s) Yu dn t knw the transitins T(s,a,s ) Yu dn t knw the rewards R(s,a,s ) Gal: learn the state values In this case: Learner is alng fr the ride N chice abut what actins t take Just execute the plicy and learn frm experience This is NOT ffline planning! Yu actually take actins in the wrld. Gal: Cmpute values fr each state under π Idea: Average tgether bserved sample values Act accrding t π Every time yu visit a state, write dwn what the sum f discunted rewards turned ut t be Average thse samples This is called direct evaluatin

Example: Direct Evaluatin -10 +8 +4 +10-2 Prblems with Direct Evaluatin What s gd abut direct evaluatin? It s easy t understand It desn t require any knwledge f T, R It eventually cmputes the crrect average values, using just sample transitins What s bad abut it? It wastes infrmatin abut state cnnectins Each state must be learned separately S, it takes a lng time t learn Why Nt Use Plicy Evaluatin? Simplified Bellman updates calculate V fr a fixed plicy: Each rund, replace V with a ne-step-lk-ahead layer ver V Quiz 3: Passive Reinfrcement Learning Estimate the utput values f the fllwing: What gives? This apprach fully explited the cnnectins between the states Unfrtunately, we need T and R t d it! Key questin: hw can we d this update t V withut knwing T and R? In ther wrds, hw d we take a weighted average withut knwing the weights?

Quiz 4: Rapid-Fire Click-in V π (A)= V π (B)= V π (C)= V π (D)= V π (E)= A) 10 B) 8 C) 4 D) -2 E) -10 Sample-Based Plicy Evaluatin? We want t imprve ur estimate f V by cmputing these averages: Idea: Take samples f utcmes s (by ding the actin!) and average Sample-Based Plicy Evaluatin? We want t imprve ur estimate f V by cmputing these averages: Idea: Take samples f utcmes s (by ding the actin!) and average Tempral Difference Learning Big idea: learn frm every experience! Update V(s) each time we experience a transitin (s,a,s;,r) Likely utcmes s will cntribute updates mre ften Tempral difference learning f values Plicy still fixed, still ding evaluatin! Mve values tward value f whatever successr ccurs: running average

TD Dem! Expnential Mving Average Expnential mving average The running interplatin update: Makes recent samples mre imprtant: Frgets abut the past (distant past values were wrng anyway) Decreasing learning rate (alpha) can give cnverging averages Prblems with TD Value Learning Quiz 4: TD Learning TD value learning is a mdel-free way t d plicy evaluatin, mimicking Bellman updates with running sample averages Hwever, if we want t turn values int a (new) plicy, we re sunk: Idea: learn Q-values, nt values Makes actin selectin mdel-free t!

Active Reinfrcement Learning Active Reinfrcement Learning Full reinfrcement learning: ptimal plicies (like value iteratin) Yu dn t knw the transitins T(s,a,s ) Yu dn t knw the rewards R(s,a,s ) Yu chse the actins nw Gal: learn the ptimal plicy / values In this case: Learner makes chices! Fundamental tradeff: explratin vs. explitatin This is NOT ffline planning! Yu actually take actins in the wrld and find ut what happens Detur Q-Value Iteratin Value iteratin: find successive (depth-limited) values Q-Learning Q-Learning: sample-based Q-value iteratin Learn Q(s,a) values as yu g Start with V0(s)=0, which we knw is right Given Vk, calculate the depth k+1 values fr all states: But Q-values are mre useful, s cmpute them instead! Start with Q0(s,a)=0, which we knw is right Given Qk, calculate the depth k+1 q-values fr all q-states: Receive a sample (s,a,s,r) Cnsider yur ld estimate: Cnsider yur new sample estimate: Incrprate the new estimate int a running average:

Q-Learning Dem: Gridwrld Q-Learning Dem: Crawler Q-Learning Prperties Amazing result: Q-learning cnverges t ptimal plicy even if yu re acting subptimally! Quiz 5: Q-Learning Which f the fllwing equatins is the Q-value iteratin update? This is called ff-plicy learning Caveats: Yu have t explre enugh Yu have t eventually make the learning rate small enugh but nt decrease it t quickly Basically, in the limit, it desn t matter hw yu select actins(!) T/F If α=1, n averaging will happen --- instead simply the value frm the sample will be used. T/F If α=0, then the sample will nt influence the update.

Next time Amazing result: Q-learning cnverges t ptimal plicy even if yu re acting subptimally! but hw d we select actins? what if ur state/actin space is t large t maintain?