Reinforcement learning

Similar documents
Reinforcement learning

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

CS 188: Artificial Intelligence Fall Probabilistic Models

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

Representing Knowledge. CS 188: Artificial Intelligence Fall Properties of BNs. Independence? Reachability (the Bayes Ball) Example

Radiation Therapy Treatment Decision Making for Prostate Cancer Patients Based on PSA Dynamics

Sections 3.1 and 3.4 Exponential Functions (Growth and Decay)

Low-complexity Algorithms for MIMO Multiplexing Systems

Combinatorial Approach to M/M/1 Queues. Using Hypergeometric Functions

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Probabilistic Models. CS 188: Artificial Intelligence Fall Independence. Example: Independence. Example: Independence? Conditional Independence

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

STUDY OF THE STRESS-STRENGTH RELIABILITY AMONG THE PARAMETERS OF GENERALIZED INVERSE WEIBULL DISTRIBUTION

20. Applications of the Genetic-Drift Model

An random variable is a quantity that assumes different values with certain probabilities.

Presentation Overview

Computer Propagation Analysis Tools

General Non-Arbitrage Model. I. Partial Differential Equation for Pricing A. Traded Underlying Security

PHYS PRACTICE EXAM 2

Relative and Circular Motion

Lecture-V Stochastic Processes and the Basic Term-Structure Equation 1 Stochastic Processes Any variable whose value changes over time in an uncertain

Today - Lecture 13. Today s lecture continue with rotations, torque, Note that chapters 11, 12, 13 all involve rotations

Lecture 18: Kinetics of Phase Growth in a Two-component System: general kinetics analysis based on the dilute-solution approximation

Risk tolerance and optimal portfolio choice

SMT 2014 Calculus Test Solutions February 15, 2014 = 3 5 = 15.

MATHEMATICAL FOUNDATIONS FOR APPROXIMATING PARTICLE BEHAVIOUR AT RADIUS OF THE PLANCK LENGTH

The sudden release of a large amount of energy E into a background fluid of density

Policy regimes Theory

, on the power of the transmitter P t fed to it, and on the distance R between the antenna and the observation point as. r r t

Distributed Search Systems with Self-Adaptive Organizational Setups

Kalman Filter: an instance of Bayes Filter. Kalman Filter: an instance of Bayes Filter. Kalman Filter. Linear dynamics with Gaussian noise

Comparing Means: t-tests for One Sample & Two Related Samples

Variance and Covariance Processes

The Production of Polarization

1 Review of Zero-Sum Games

Linear Response Theory: The connection between QFT and experiments

The Global Trade and Environment Model: GTEM

INSTANTANEOUS VELOCITY

Control Volume Derivation

Lecture 17: Kinetics of Phase Growth in a Two-component System:

The k-filtering Applied to Wave Electric and Magnetic Field Measurements from Cluster

Exponential and Logarithmic Equations and Properties of Logarithms. Properties. Properties. log. Exponential. Logarithmic.

Non-sinusoidal Signal Generators

2-d Motion: Constant Acceleration

OBJECTIVES OF TIME SERIES ANALYSIS

Linear Time-invariant systems, Convolution, and Cross-correlation

Unit Root Time Series. Univariate random walk

Topic Astable Circuits. Recall that an astable circuit has two unstable states;

Final Exam. Tuesday, December hours, 30 minutes

Homework-8(1) P8.3-1, 3, 8, 10, 17, 21, 24, 28,29 P8.4-1, 2, 5

Competitive and Cooperative Inventory Policies in a Two-Stage Supply-Chain

Macroeconomic Theory Ph.D. Qualifying Examination Fall 2005 ANSWER EACH PART IN A SEPARATE BLUE BOOK. PART ONE: ANSWER IN BOOK 1 WEIGHT 1/3

Servomechanism Design

AN EVOLUTIONARY APPROACH FOR SOLVING DIFFERENTIAL EQUATIONS

Exam 3 Review (Sections Covered: , )

Solutions Problem Set 3 Macro II (14.452)

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

E β t log (C t ) + M t M t 1. = Y t + B t 1 P t. B t 0 (3) v t = P tc t M t Question 1. Find the FOC s for an optimum in the agent s problem.

The Brock-Mirman Stochastic Growth Model

System Processes input signal (excitation) and produces output signal (response)

Reserves measures have an economic component eg. what could be extracted at current prices?

Chapter 12: Velocity, acceleration, and forces

Lecture 22 Electromagnetic Waves

Viterbi Algorithm: Background

!!"#"$%&#'()!"#&'(*%)+,&',-)./0)1-*23)

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

f(x) dx with An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples dx x x 2

ENGI 4430 Advanced Calculus for Engineering Faculty of Engineering and Applied Science Problem Set 9 Solutions [Theorems of Gauss and Stokes]

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

7 Wave Equation in Higher Dimensions

CS 4495 Computer Vision Tracking 1- Kalman,Gaussian

Final Exam. Tuesday, December hours

Two-dimensional Effects on the CSR Interaction Forces for an Energy-Chirped Bunch. Rui Li, J. Bisognano, R. Legg, and R. Bosch

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Bayes Nets. CS 188: Artificial Intelligence Spring Example: Alarm Network. Building the (Entire) Joint

Notes on Kalman Filtering

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Tournament selection in zeroth-level classifier systems based on. average reward reinforcement learning

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

MEEN 617 Handout #11 MODAL ANALYSIS OF MDOF Systems with VISCOUS DAMPING

Two Coupled Oscillators / Normal Modes

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

Statistics versus mean-field limit for Hawkes process. with Sylvain Delattre (P7)

The general Solow model

International Journal of Pure and Applied Sciences and Technology

[ ] 0. = (2) = a q dimensional vector of observable instrumental variables that are in the information set m constituents of u

Power of Random Processes 1/40

Laplace Transforms. Examples. Is this equation differential? y 2 2y + 1 = 0, y 2 2y + 1 = 0, (y ) 2 2y + 1 = cos x,

Fishing limits and the Logistic Equation. 1

Chapter Finite Difference Method for Ordinary Differential Equations

Research on the Algorithm of Evaluating and Analyzing Stationary Operational Availability Based on Mission Requirement

Sophisticated Monetary Policies. Andrew Atkeson. V.V. Chari. Patrick Kehoe

Machine Learning 4771


Lecture 2 October ε-approximation of 2-player zero-sum games

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Finite-Sample Effects on the Standardized Returns of the Tokyo Stock Exchange

r r r r r EE334 Electromagnetic Theory I Todd Kaiser

Transcription:

Lecue 3 Reinfocemen leaning Milos Hauskech milos@cs.pi.edu 539 Senno Squae Reinfocemen leaning We wan o lean he conol policy: : X A We see examples of x (bu oupus a ae no given) Insead of a we ge a feedback (einfocemen, ewad) fom a ciic quanifying how good he seleced oupu was Inpu x Leane Oupu a Reinfocemen Ciic The einfocemens may no be deeminisic Goal: find : X A wih he bes expeced einfocemens 1

Gambling example. Game: 3 diffeen biased coins ae ossed The coin o be ossed is seleced andomly fom he hee opions and I always see which coin I am going o play nex I make bes on head o ail and I always wage $1 If I win I ge $1, ohewise I lose my be RL model: Inpu: X a coin chosen fo he nex oss, Acion: A choice of head o ail, Reinfocemens: {1, -1} A policy : X A Example: : Coin1 head Coin ail Coin3 head Gambling example RL model: Inpu: X a coin chosen fo he nex oss, Acion: A choice of head o ail, Reinfocemens: {1, -1} A policy : Coin1 head Coin ail Coin3 head Leaning goal: find : X A maximizing fuue expeced pofis : Coin1? Coin? Coin3? 0 E ( ) a discoun faco = pesen value of money

Agen navigaion example. Agen navigaion in he Maze: 4 moves in compass diecions Effecs of moves ae sochasic we may wind up in ohe han inended locaion wih non-zeo pobabiliy Objecive: each he goal sae in he shoes expeced ime moves G Agen navigaion example The RL model: Inpu: X posiion of an agen Oupu: A a move Reinfocemens: R -1 fo each move +100 fo eaching he goal A policy: : X A : Posiion 1 Posiion Posiion 0 G igh igh lef moves Goal: find he policy maximizing fuue expeced ewads E ( ) 0 3

Objecives of RL leaning Objecive: * Find a mapping : X A Tha maximizes some combinaion of fuue einfocemens (ewads) eceived ove ime Valuaion models (quanify how good he mapping is): Finie hoizon model E ( T 0 Infinie hoizon discouned model 0 Aveage ewad T 1 lim E ( ) T T ) E ( ) Discoun faco: 0 1 0 Time hoizon: T 0 Exploaion vs. Exploiaion The (leane) acively ineacs wih he envionmen: A he beginning he leane does no know anyhing abou he envionmen I gadually gains he expeience and leans how o eac o he envionmen Dilemma (exploaion-exploiaion): Afe some numbe of seps, should I selec he bes cuen choice (exploiaion) o y o lean moe abou he envionmen (exploaion)? Exploiaion may involve he selecion of a sub-opimal acion and peven he leaning of he opimal choice Exploaion may spend o much ime on ying bad cuenly subopimal acions 4

Effecs of acions on he envionmen Effec of acions on he envionmen (nex inpu x o be seen) No effec, he disibuion ove possible x is fixed; acion consequences (ewads) ae seen immediaely, Ohewise, disibuion of x can change; he ewads elaed o he acion can be seen wih some delay. Leads o wo foms of einfocemen leaning: Leaning wih immediae ewads Gambling example Leaning wih delayed ewads Agen navigaion example; move choices affec he sae of he envionmen (posiion changes), a big ewad a he goal sae is delayed RL wih immediae ewads Game: 3 diffeen biased coins ae ossed The coin o be ossed is seleced andomly fom he hee opions and I always see which coin I am going o play nex I make bes on head o ail and I always wage $1 If I win I ge $1, ohewise I lose my be RL model: Inpu: X a coin chosen fo he nex oss Acion: A head o ail be Reinfocemens: {1, -1} Leaning goal: find : X A maximizing he fuue expeced pofis ove ime 0 E ( ) a discoun faco = pesen value of money 5

Expeced ewad 0 RL wih immediae ewads E ( ) - a discoun faco = pesen value of money Immediae ewad case: Rewad fo he choice becomes available immediaely Ou choice does no affec envionmen and hus fuue ewads 0 E ( ) E ( ) E ( ) E (, 1,... 0 0 Expeced one sep ewad fo inpu x and he choice a : R ( x, a ) 1 Rewads fo evey sep )... RL wih immediae ewads Immediae ewad case: Rewad fo he choice a becomes available immediaely Expeced ewad fo he inpu x and choice a: R ( x, a ) Fo he gambling poblem i can be defined as: R ( x, a ) ( a, x ) P ( j x, a i ) i j j j- a hidden oucome of he coin oss Recall he definiion of he expeced loss Expeced one sep ewad fo a saegy : X A R ( ) R ( ) R ( x, ( x )) P ( x ) x is he expeced ewad fo i, 1,... 0 6

Expeced ewad RL wih immediae ewads Opimizing he expeced ewad : max E( 0 E ( ) E ( 0 ) E ( 1 ) E ( 0 ) max 0 E( ) max 0 )... R( ) max R( )( 0 ) ( 0 ) max R( ) max R ( ) max R ( x, ( x)) P ( x) x Opimal saegy: * : X A * ( x ) ag max R ( x, a ) a x P ( x)[ max ( x ) R ( x, ( x))] RL wih immediae ewads We know ha * ( x) ag max R( x, a Poblem: In he RL famewok we do no know R ( x, a ) The expeced ewad fo pefoming acion a a inpu x How o ge R ( x, a )? 7

RL wih immediae ewads Poblem: In he RL famewok we do no know R ( x, a ) The expeced ewad fo pefoming acion a a inpu x Soluion: Fo each inpu x y diffeen acions a Esimae R ( x, a ) using he aveage of obseved ewads ~ R ( x, a ) 1 N x, a, ~ Acion choice ( x) ag max R ( x, a Accuacy of he esimae: saisics (Hoeffding s bound) ~ N x, a P R ( x, a ) R ( x, a ) exp ( max min ) Numbe of samples: ( max min ) 1 N x, a ln N x i 1 a x, a i RL wih immediae ewads On-line (sochasic appoximaion) An alenaive way o esimae R ( x, a ) Idea: choose acion a fo inpu x and obseve a ewad Updae an esimae R ~ ( x, a ) (1 ) R ~ ( x, a ) Convegence popey: The appoximaion conveges in he limi fo an appopiae leaning ae schedule. Assume: ( n ( x, a )) - is a leaning ae fo nh ial of (x, pai Then he convege is assued if: i 1 1. ( i ). x, a i 1 (i) x a, - a leaning ae 8

Exploaion vs. Exploiaion In he RL famewok he (leane) acively ineacs wih he envionmen. ~ A any poin in ime i has an esimae of R ( x, fo any inpu acion pai Dilemma: Should he leane use he cuen bes choice of acion (exploiaion) ˆ ( x) ag max R ~ ( x, a A O choose ohe acion a and fuhe impove is esimae (exploaion) Diffeen exploaion/exploiaion saegies exis Exploaion vs. Exploiaion Unifom exploaion Choose he cuen bes choice ~ wih pobabiliy 1 ˆ ( x) ag max R ( x, a A All ohe choices ae seleced wih a unifom pobabiliy A 1 Bolzman exploaion The acion is chosen andomly bu popoionally o is cuen expeced ewad esimae exp R ~ ( x, / T p( a x) ~ exp R ( x, a' ) / T a ' A T is empeaue paamee. Wha does i do? 9