Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Similar documents
Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

You need to be able to define the following terms and answer basic questions about them:

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Department of Electrical Engineering, University of Waterloo. Introduction

Part 3 Introduction to statistical classification techniques

Pattern Recognition 2014 Support Vector Machines

Assessment Primer: Writing Instructional Objectives

Differentiation Applications 1: Related Rates

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

IAML: Support Vector Machines

Math Foundations 20 Work Plan

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Simple Linear Regression (single variable)

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Course Review. COMP3411 c UNSW, 2014

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Math 105: Review for Exam I - Solutions

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Kinetic Model Completeness

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Dataflow Analysis and Abstract Interpretation

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Sequential Allocation with Minimal Switching

Getting Involved O. Responsibilities of a Member. People Are Depending On You. Participation Is Important. Think It Through

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Trigonometric Ratios Unit 5 Tentative TEST date

Chapter 3: Cluster Analysis

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

CS:4420 Artificial Intelligence

Elements of Machine Intelligence - I

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Five Whys How To Do It Better

Lab 1 The Scientific Method

COMP 551 Applied Machine Learning Lecture 4: Linear classification

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Activity Guide Loops and Random Numbers

x x

NUMBERS, MATHEMATICS AND EQUATIONS

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Physics 2010 Motion with Constant Acceleration Experiment 1

Materials Engineering 272-C Fall 2001, Lecture 7 & 8 Fundamentals of Diffusion

Support-Vector Machines

Computational modeling techniques

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Hypothesis Tests for One Population Mean

A Scalable Recurrent Neural Network Framework for Model-free

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Thermodynamics Partial Outline of Topics

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

INSTRUMENTAL VARIABLES

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Resampling Methods. Chapter 5. Chapter 5 1 / 52

We can see from the graph above that the intersection is, i.e., [ ).

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

CONSTRUCTING STATECHART DIAGRAMS

Lab #3: Pendulum Period and Proportionalities

Thermodynamics and Equilibrium

Petrel TIPS&TRICKS from SCM

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

What is Statistical Learning?

Least Squares Optimal Filtering with Multirate Observations

, which yields. where z1. and z2

Plan o o. I(t) Divide problem into sub-problems Modify schematic and coordinate system (if needed) Write general equations

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

A Few Basic Facts About Isothermal Mass Transfer in a Binary Mixture

Ecology 302 Lecture III. Exponential Growth (Gotelli, Chapter 1; Ricklefs, Chapter 11, pp )

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

Public Key Cryptography. Tim van der Horst & Kent Seamons

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

The standards are taught in the following sequence.

SUMMER REV: Half-Life DUE DATE: JULY 2 nd

WYSE Academic Challenge Regional Mathematics 2007 Solution Set

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

ENSC Discrete Time Systems. Project Outline. Semester

GMM with Latent Variables

Dead-beat controller design

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

INSTRUCTIONAL PLAN Day 2

Lecture 5: Equilibrium and Oscillations

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Tree Structured Classifier

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Query Expansion. Lecture Objectives. Text Technologies for Data Science INFR Learn about Query Expansion. Implement: 10/24/2017

Hiding in plain sight

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

ANSWER KEY FOR MATH 10 SAMPLE EXAMINATION. Instructions: If asked to label the axes please use real world (contextual) labels

Purchase Order Workflow Processing

Accelerated Chemistry POGIL: Half-life

Transcription:

Reinfrcement Learning" CMPSCI 383 Nv 29, 2011! 1

Tdayʼs lecture" Review f Chapter 17: Making Cmple Decisins! Sequential decisin prblems! The mtivatin and advantages f reinfrcement learning.! Passive learning! Plicy evaluatin! Direct utility estimatin! Adaptive dynamic prgramming! Tempral Difference (TD) learning! 2

Check this ut" MIT's Technlgy Review has an in-depth interview with Peter Nrvig, Ggle'sDirectr f Research, and Eric Hrvitz, a Distinguished Scientist at MicrsftResearch abut their ptimism fr the futuref AI.! http://www.technlgyreview.cm/cmputing/39156/! 3

A Simple Eample" Gridwrld with 2 Gal states! Actins: Up, Dwn, Left, Right! Fully bservable: Agent knws where it is! 4

Transitin Mdel" 5

Markv Assumptin" 6

Agentʼs Utility Functin" Perfrmance depends n the entire sequence f states and actins.! Envirnment histry! In each state, the agent receives a reward R(s).! The reward is real-valued. It may be psitive r negative.! Utility f envirnment histry = sum f reward received.! 7

Reward functin" -.04 -.04 -.04 -.04 -.04 -.04 -.04 -.04 8

Decisin Rules" Decisin rules say what t d in each state.! Often called plicies, π. Actin fr state s is given by π(s).! 9

Our Gal" Find the plicy that maimizes the epected sum f rewards.! Called an ptimal plicy.! 10

Markv Decisin Prcess (MDP)" M=(S,A,P,R)! S = set f pssible states! A = set f pssible actins! P(sʼ s,a) gives transitin prbabilities! R = reward functin! Gal: find an ptimal plicy, π*.! 11

Finite/Infinite Hrizn" Finite hrizn: the game ends after N steps.! Infinite hrizn: the game never ends! With a finite hrizn, the ptimal actin in a given state culd change ver time.! The ptimal plicy is nnstatinary.! With infinite hrizn, the ptimal plicy is statinary.! 12

Utilities ver time" Additive rewards:! Discunted rewards:! Discunt factr:! 13

Discunted Rewards" Wuld yu rather have a marshmallw nw, r tw in 20 minutes?! Infinite sums!! 14

Utility f States" Given a plicy, we can define the utility f a state:! 15

Plicy Evaluatin" Finding the utility f states fr a given plicy.! Slve a system f linear equatins:! π π An instance f a Bellman Equatin.! 16

Plicy Evaluatin" The n linear equatins can be slved in O(n 3 ) with standard linear algebra methds.! If O(n 3 ) is still t much, we can d it iteratively:! π π 17

Optimal Plicy" Optimal plicy desnʼt depend n what state yu start in (fr infinite hrizn discunted case).! Optimal plicy:! π * True utility f a state:! ptimal value functin! 18

Selecting ptimal actins" Given the true U(s) values, hw can we select actins? (Maimum epected utility MEU)! 19

Utility and Rewards" Utility = lng term ttal reward frm s nwards! Reward = shrt term reward frm s! 20

Utility" 21

Value-Guided Optimal Cntrl" Predicted minimum time t gal (negated) Get t the tp f the hill as quickly as pssible (rughly) Muns & Mre Variable reslutin discretizatin fr high-accuracy slutins f ptimal cntrl prblems, IJCAI 99. 22

Searching fr Optimal Plicies" Bellman Equatin! If we write ut the Bellman equatin fr all n states, we get n equatins, with n unknwns: U(s).! We can slve this system f equatins t determine the Utility f every state.! 23

Value Iteratin" The equatins are nn-linear, s we canʼt use standard linear algebra methds.! Value iteratin: start with randm initial values fr each U(s), iteratively update each value t fit the fight-hand side f the equatin:! 24

Value Iteratin" The update is applied simultaneusly t every state.! If this update is applied infinitely ften, we are guaranteed t find the true U(s) values.! There is ne unique slutin! Given the true U(s) values, hw can we select actins? (Maimum epected utility MEU)! 25

Plicy Iteratin" Plicy iteratin interleaves tw steps:! Plicy evaluatin: Given a plicy, cmpute the utility f each state fr that plicy! Plicy imprvement: Calculate a new MEU plicy! Terminate when the plicy desnʼt change the utilities.! Guaranteed t cnverge t an ptimal plicy! 26

Asynchrnus Plicy Iteratin" We said the utility f every state is updated simultaneusly. This isnʼt necessary.! Yu can pick and subset f the states and apply either plicy imprvement r value iteratin t that subset.! Given certain cnditins, this is als guaranteed t cnverge t an ptimal plicy.! 27

Reinfrcement Learning" 28

Edward L. Thrndike (1874-1949)" Learning by Trial-and-Errr! puzzle b! 29

Law f Effect " Of several respnses made t the same situatin, thse which are accmpanied r clsely fllwed by satisfactin t the animal will, ther things being equal, be mre firmly cnnected with the situatin, s that, when it recurs, they will be mre likely t recur; thse which are accmpanied r clsely fllwed by discmfrt t the animal will, ther things being equal, have their cnnectins with that situatin weakened, s that when it recurs, they will be less likely t ccur. Edward Thrndike, 1911 30

RL = Search + Memry" Search: Trial-and-Errr, Generate-and-Test, Variatin-and-Selectin! Memry: remember what wrked best fr each situatin and start frm there net time! 31

32 MENACE (Michie, 1961)" Matchb Educable Nughts and Crsses Engine

Supervised learning" Training Inf = desired (target) utputs! Inputs! Supervised Learning " System! Outputs! Errr = (target utput actual utput)! 33

Reinfrcement learning" Training Inf = evaluatins ( rewards, penalties, etc.)! RL" Inputs! Outputs! System! ( actins, cntrls )! Objective: get as much reward as pssible! 34

Reinfrcement learning" Learning what t d hw t map situatins t actins s as t maimize a numerical reward signal! Rewards may be prvided fllwing each actin, r nly when the agent reaches a gal state.! The credit assignment prblem.! 35

Why reinfrcement learning?" Mre realistic mdel f the kind f learning that animals d.! Supervised learning is smetimes unrealistic: where will crrect eamples cme frm?! Envirnments change, and s the agent must adjust its actin chices.! 36

Stick t this kind f RL prblem here:" Envirnment is fully bservable: agent knws what state the envirnment is in! But: agent des nt knw hw the envirnment wrks r what its actins d! 37

Sme kinds f RL agents" Utility-based agent: learns utility functin n states and uses it t select actins! Needs an envirnment mdel t decide n actins! Q-learning agent: leans and actin-utility functins, r Q-functin, giving epected utility fr taking each actin in each state.! Des nt need an envirnment mdel.! Refle agent: learn a plicy withut first learning a state-utility functin r a Q-functin! 38

Passive versus active learning " A passive learner simply watches the wrld ging by and tries t learn the utility f being in varius states.! An active learner must als act using the learned infrmatin, and can use its prblem generatr t suggest eplratins f unknwn prtins f the envirnment.! 39

Passive learning" Given (but agent desnʼt knw this):" A Markv mdel f the envirnment.! States, with prbabilistic actins.! Terminal states have rewards/utilities.! Prblem:" Learn epected utility f each state.! Nte: if agent knws hw the envirnment and its actins wrk, can slve the relevant Bellman equatin (which wuld be linear).! 40

Eample" -.04 -.04 -.04 -.04 -.04 -.04 -.04 -.04 Percepts tell agent: [State, Reward, Terminal?] 41

Learning utility functins" A training sequence (r episde) is an instance f wrld transitins frm an initial state t a terminal state.! The additive utility assumptin: utility f a sequence is the sum f the rewards ver the states f the sequence.! Under this assumptin, the utility f a state is the epected reward-t-g f that state.! 42

Direct Utility Estimatin" Fr each training sequence, cmpute the reward-t-g fr each state in the sequence and update the utilities.! This is just learning the utility functin frm eamples.! Generates utility estimates that minimize the mean square errr (LMS-update).! 43

Direct Utility Estimatin" U(i) (1 α)u(i) + α REWARD(training sequence) state i! T T T T T T T T T T 44

Prblems with direct utility estimatin" Cnverges slwly because it ignres the relatinship between neighbring states:! 1 New U=? Old U=.8 p=.9 p=.1 +1 45

Key Observatin: Tempral Cnsistency" Utilities f states are nt independent! The utility f each state equals its wn reward plus the epected utility f its successr states:! U π (s) = R(s) +γ s ʹ P( s ʹ s, π(s)) U π ( s ʹ ) The key fact:! U t = r t +1 +γ r t +2 +γ 2 r t +3 +γ 3 r t +4 L = r t +1 +γ( r t +2 +γ r t +3 +γ 2 r t +4 L) = r t +1 +γu t +1 46

Adaptive dynamic prgramming" Learn a mdel: transitin prbabilities, reward functin! D plicy evaluatin! Slve the Bellman equatin either directly r iteratively (value iteratin withut the ma)! Learn mdel while ding iterative plicy evaluatin:! Update the mdel f the envirnment after each step. Since the mdel changes nly slightly after each step, plicy evaluatin will cnverge quickly.! 47

Adaptive dynamic prg (ADP)" j U(i) R(i) + P ij U( j) state i R(i) pssible states j T T T T T T T T T T 48

Passive ADP learning curves" 49

Eact utility values fr the eample" 50

Tempral difference (TD) learning" Apprimate the cnstraint equatins withut slving them fr all states.! Mdify U(i) whenever we see a transitin frm i t j using the fllwing rule:! U(i) U(i) +α[ R(i) + U( j) U(i) ] The mdificatin mves U(i) clser t satisfying the riginal equatin.! Q. Why des it wrk?! 51

TD learning cntd." U(i) (1 α)u(i) + α[ R(i) + U( j) ] state i R(i) state j T T T T T T T T T T 52

TD learning curves" 53

Net Class" Active Reinfrcement Learning! Sec. 21.3! 54

Net Class" Reinfrcement Learning! Secs. 21.1 21.3! 55