Reinforcement learning

Similar documents
Reinforcement learning

f(x) dx with An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples dx x x 2

Chapter 2: Evaluative Feedback

Circuits 24/08/2010. Question. Question. Practice Questions QV CV. Review Formula s RC R R R V IR ... Charging P IV I R ... E Pt.

Reinforcement learning II

() t. () t r () t or v. ( t) () () ( ) = ( ) or ( ) () () () t or dv () () Section 10.4 Motion in Space: Velocity and Acceleration

Science Advertisement Intergovernmental Panel on Climate Change: The Physical Science Basis 2/3/2007 Physics 253

ME 141. Engineering Mechanics

4.8 Improper Integrals

e t dt e t dt = lim e t dt T (1 e T ) = 1

Homework 5 for BST 631: Statistical Theory I Solutions, 09/21/2006

Faraday s Law. To be able to find. motional emf transformer and motional emf. Motional emf

Ch.4 Motion in 2D. Ch.4 Motion in 2D

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

LECTURE 5. is defined by the position vectors r, 1. and. The displacement vector (from P 1 to P 2 ) is defined through r and 1.

Physics 201, Lecture 5

Sections 3.1 and 3.4 Exponential Functions (Growth and Decay)

#6: Double Directional Spatial Channel Model

graph of unit step function t

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

Calculus 241, section 12.2 Limits/Continuity & 12.3 Derivatives/Integrals notes by Tim Pilachowski r r r =, with a domain of real ( )

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

Average & instantaneous velocity and acceleration Motion with constant acceleration

v T Pressure Extra Molecular Stresses Constitutive equations for Stress v t Observation: the stress tensor is symmetric

Minimum Squared Error

Addition & Subtraction of Polynomials

Minimum Squared Error

Contraction Mapping Principle Approach to Differential Equations

0 for t < 0 1 for t > 0

Topics for Review for Final Exam in Calculus 16A

Data Structures. Element Uniqueness Problem. Hash Tables. Example. Hash Tables. Dana Shapira. 19 x 1. ) h(x 4. ) h(x 2. ) h(x 3. h(x 1. x 4. x 2.

10 Statistical Distributions Solutions

MTH 146 Class 11 Notes

D zone schemes

S Radio transmission and network access Exercise 1-2

REAL ANALYSIS I HOMEWORK 3. Chapter 1

A Kalman filtering simulation

Reinforcement Learning

PHYS PRACTICE EXAM 2

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

156 There are 9 books stacked on a shelf. The thickness of each book is either 1 inch or 2

Combinatorial Approach to M/M/1 Queues. Using Hypergeometric Functions

Probability, Estimators, and Stationarity

Motion on a Curve and Curvature

Chapter 2. Kinematics in One Dimension. Kinematics deals with the concepts that are needed to describe motion.

Rotations.

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

Today - Lecture 13. Today s lecture continue with rotations, torque, Note that chapters 11, 12, 13 all involve rotations

Ans: In the rectangular loop with the assigned direction for i2: di L dt , (1) where (2) a) At t = 0, i1(t) = I1U(t) is applied and (1) becomes

EECE 260 Electrical Circuits Prof. Mark Fowler

Electric Potential. and Equipotentials

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Low-complexity Algorithms for MIMO Multiplexing Systems

Homework 3 MAE 118C Problems 2, 5, 7, 10, 14, 15, 18, 23, 30, 31 from Chapter 5, Lamarsh & Baratta. The flux for a point source is:

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

2-d Motion: Constant Acceleration

( ) ( ) ( ) ( ) ( ) ( y )

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Previously. Extensions to backstepping controller designs. Tracking using backstepping Suppose we consider the general system

Computer Propagation Analysis Tools

Derivation of the differential equation of motion

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

CS 188: Artificial Intelligence Fall Probabilistic Models

Lecture-V Stochastic Processes and the Basic Term-Structure Equation 1 Stochastic Processes Any variable whose value changes over time in an uncertain

2D Motion WS. A horizontally launched projectile s initial vertical velocity is zero. Solve the following problems with this information.

Laplace Transforms. Examples. Is this equation differential? y 2 2y + 1 = 0, y 2 2y + 1 = 0, (y ) 2 2y + 1 = cos x,

5.1-The Initial-Value Problems For Ordinary Differential Equations

Mathematics 805 Final Examination Answers

Lecture 10. Solution of Nonlinear Equations - II

Control Volume Derivation

Class Summary. be functions and f( D) , we define the composition of f with g, denoted g f by

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

ANSWERS TO EVEN NUMBERED EXERCISES IN CHAPTER 2

Reinforcement Learning. Markov Decision Processes


CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Probabilistic Models. CS 188: Artificial Intelligence Fall Independence. Example: Independence. Example: Independence? Conditional Independence

September 20 Homework Solutions

Equations from The Four Principal Kinetic States of Material Bodies. Copyright 2005 Joseph A. Rybczyk

The Production of Polarization

Week 8. Topic 2 Properties of Logarithms

Representing Knowledge. CS 188: Artificial Intelligence Fall Properties of BNs. Independence? Reachability (the Bayes Ball) Example

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Chapter Direct Method of Interpolation

An random variable is a quantity that assumes different values with certain probabilities.

Answers to test yourself questions

Signals and Systems Profs. Byron Yu and Pulkit Grover Fall Midterm 1 Solutions

MATHEMATICAL FOUNDATIONS FOR APPROXIMATING PARTICLE BEHAVIOUR AT RADIUS OF THE PLANCK LENGTH

Equations and Inequalities

Servomechanism Design

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

Physics 101 Lecture 4 Motion in 2D and 3D

Relative and Circular Motion

General Non-Arbitrage Model. I. Partial Differential Equation for Pricing A. Traded Underlying Security

Technical Vibration - text 2 - forced vibration, rotational vibration

Non-sinusoidal Signal Generators

Transcription:

CS 75 Mchine Lening Lecue b einfocemen lening Milos Huskech milos@cs.pi.edu 539 Senno Sque einfocemen lening We wn o len conol policy: : X A We see emples of bu oupus e no given Insed of we ge feedbck einfocemen ewd fom ciic qunifying how good he seleced oupu ws Inpu Lene Oupu einfocemen Ciic he einfocemens my no be deeminisic Gol: find : X A wih he bes epeced einfocemens

Gmbling emple Gme: 3 bised coins 3 he coin o be ossed is seleced ndomly fom he hee coin opions. he gen lwys sees which coin is going o be plyed ne. he gen mkes be on eihe hed o il wih wge of $. If fe he coin oss he oucome gees wih he be he gen wins $ ohewise i looses $ L model: Inpu: X coin chosen fo he ne oss Acion: A choice of hed o il he gen bes on einfocemens: { -} A policy : X A mple: Coin : Coin Coin3 hed il hed : 3 hed il hed Gmbling emple L model: Inpu: X coin chosen fo he ne oss Acion: A choice of hed o il he gen bes on einfocemens: { -} A policy : Coin hed Lening gol: find he opiml policy *: X A mimizing fuue epeced pofis Coin Coin3 il hed *: discoun fco = pesen vlue of money 3???

peced ewds peced ewds fo : X A un ime un ime un 3 ime pecion ove mny possible ewd jecoies fo : X A peced discouned ewds peced discouning ewds fo : X A Discouning wih fuue vlue of money No discouning: un ime un Discouning ime pecion ove mny possible discouned ewd jecoies fo : X A 3

L lening: objecive funcions Objecive: * Find mpping : X A h mimizes some combinion of fuue einfocemens ewds eceived ove ime Vluion models qunify how good he mpping is: Finie hoizon models Infinie hoizon discouned model Avege ewd ime hoizon: Discoun fco: lim Discoun fco: Agen nvigion emple Agen nvigion in he mze: 4 moves in compss diecions ffecs of moves e sochsic we my wind up in ohe hn inended locion wih non-zeo pobbiliy Objecive: len how o ech he gol se in he shoes epeced ime moves G 4

Agen nvigion emple he L model: Inpu: X posiion of n gen Oupu: A he ne move einfocemens: - fo ech move + fo eching he gol A policy: : X A Gol: find he policy mimizing fuue epeced ewds : Posiion Posiion Posiion G igh igh lef moves ploion vs. ploiion in L he lene cively inecs wih he envionmen: A he beginning he lene does no know nyhing bou he envionmen I gdully gins he epeience nd lens how o ec o he envionmen Dilemm eploion-eploiion: Afe some numbe of seps should I selec he bes cuen choice eploiion o y o len moe bou he envionmen eploion? ploiion my involve he selecion of sub-opiml cion nd peven he lening of he opiml choice ploion my spend o much ime on ying bd cuenly subopiml cions 5

ffecs of cions on he envionmen ffec of cions on he envionmen ne inpu o be seen No effec. he disibuion ove possible is fied nd independen of ps cions. he ewds eceived depend only on he se nd cion chosen. he e seen fe he cion. Acions my effec he envionmen nd ne inpus. he disibuion of cn chnge due o ps cions; he ewds eled o he cion cn be seen wih some dely. Leds o wo foms of einfocemen lening: Lening wih immedie ewds 3 coin emple 3 Lening wih delyed ewds Agen nvigion emple; move choices ffec he se of he envionmen posiion chnges big ewd he gol se is delyed L wih immedie ewds Gme: 3 bised coins 3 he coin o be ossed is seleced ndomly fom he hee coin opions. he gen lwys sees which coin is going o be plyed ne. he gen mkes be on eihe hed o il wih wge of $. If fe he coin oss he oucome gees wih he be he gen wins $ ohewise i looses $ L model: Inpu: X coin chosen fo he ne oss Acion: A hed o il he gen bes on einfocemens: { -} $ eihe won o los Lening gol: find he opiml policy *: X A mimizing he fuue epeced pofis ove ime discoun fco 6

L wih immedie ewds peced ewd Immedie ewd cse: ewd depends only on nd he cion choice he cion does no ffec he envionmen nd hence fuue inpus ses nd fuue ewds:...... peced one sep ewd fo inpu coin o ply ne nd he choice : ewds fo evey sep of he gme j L wih immedie ewds Immedie ewd cse: ewd fo inpu nd he cion choice my vy peced ewd fo he inpu nd choice : Fo he coin be poblem i is: i j i P j i j : n oucome of he coin oss : ewd fo n oucome nd he be mde on j i peced one sep ewd fo segy P is he epeced ewd fo : X A... 7

8 L wih immedie ewds peced ewd Opimizing he epeced ewd : Opiml segy: m m m m... X A *: ] [m m m P P g m * m L wih immedie ewds We know h Poblem: In he L fmewok we do no know he epeced ewd fo pefoming cion inpu How o esime? g m *

L wih immedie ewds Poblem: In he L fmewok we do no know he epeced ewd fo pefoming cion inpu Soluion: Fo ech inpu y diffeen cions sime using he vege of obseved ewds N Acion choice g m Accucy of he esime: sisics Hoeffding s bound N P ep Numbe of smples: N i i N m m min min ln L wih immedie ewds On-line sochsic ppoimion An lenive wy o esime Ide: choose cion fo inpu nd obseve ewd Upde n esime in evey sep i i i i i i i - lening e Convegence popey: he ppoimion conveges in he limi fo n ppopie lening e schedule. Assume: n - is lening e fo nh il of pi hen he convege is ssued if: i. i. i i 9

L wih immedie ewds A ny sep in ime i duing he epeimen we hve esimes of epeced ewds fo ech coin cion pi: i coin hed i coin il i coin hed i coin il i coin3 hed i coin3 il Assume he ne coin o ply in sep i+ is coin nd we pick hed s ou be. hen we upde i coin hed using he obseved ewd nd one of he upde segy bove nd keep he ewd esimes fo he emining coin cion pis unchnged e.g. i coin il coin il i ploion vs. ploiion In he L fmewok he lene cively inecs wih he envionmen nd choses he cion o ply fo he cuen inpu Also ny poin in ime i hs n esime of fo ny inpucion pi Dilemm fo choosing he cion o ply fo : Should he lene choose he cuen bes choice of cion eploiion ˆ g m A O choose some ohe cion which my help o impove is esime eploion his dilemm is clled eploion/eploiion dilemm Diffeen eploion/eploiion segies eis

ploion vs. ploiion Unifom eploion: ploion pmee Choose he cuen bes choice wih pobbiliy ˆ g m A A All ohe choices e seleced wih unifom pobbiliy Bolzmn eploion he cion is chosen ndomly bu popoionlly o is cuen epeced ewd esime ep / p ep ' / ' A is empeue pmee. Wh does i do?