Reinforcement learning II
|
|
- Ashley Hawkins
- 5 years ago
- Views:
Transcription
1 CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic Lerner intercts with the environment Receives input with informtion bout the environment (e.g. from sensors) Mkes ctions tht (my) effect the environment Receives reinforcement signl tht provides feedbck on how well it performed 1
2 Reinforcement lerning Objective: Lern how to ct in the environment in order to mximize the reinforcement signl The selection of ctions should depend on the input A policy : X A mps inputs to ctions Gol: find the optiml policy : X A tht gives the best expected reinforcements Input x Lerner Output Reinforcement r Critic Exmple: lern how to ply gmes (AlphGo) Gmbling exmple Gme: 3 bised coins The coin to be tossed is selected rndomly from the three coin options. The gent lwys sees which coin is going to be plyed next. The gent mkes bet on either hed or til with wge of $1. If fter the coin tos the outcome grees with the bet, the gent wins $1, otherwise it looses $1 RL model: Input: X coin chosen for the next tos Action: A choice of hed or til the gent bets on, Reinforcements: {1, -1} A policy : X A Exmple: : Coin1 Coin2 Coin3 hed til hed : hed til hed 2
3 Gmbling exmple RL model: Input: X coin chosen for the next tos Action: A choice of hed or til the gent bets on, Reinforcements: {1, -1} A policy : Coin1 Coin2 Coin3 hed til hed Stte, ction rewrd trjectories Step0 Step1 Step2 Step k stte ction Coin2 Til Coin1 Hed Coin2 Til.. Coin1 Hed.. rewrd RL lerning: objective functions Objective: Find policy : X A Coin3? Tht mximizes some combintion of future reinforcements (rewrds) received over time Vlution models (quntify how good the mpping is): Finite horizon models E( T r t t0 T Infinite horizon discounted model 0 t Averge rewrd ) Time horizon: E( t rt ) Discount fctor: 1 lim E( T T T t0 r ) t T 0 E( ) Discount fctor: 0 1 t0 t rt : Coin1? Coin2? 0 1 3
4 Expected rewrd E( 0 t t rt RL with immedite rewrds ) Immedite rewrd cse: Rewrd depends only on x nd the ction choice The ction does not ffect the environment nd hence future inputs (sttes) nd future rewrds Expected one step rewrd for input x (coin to ply next) nd the choice : 0 1 Expected rewrd t0 RL with immedite rewrds t 2 E( rt ) E( r0 ) E( r1 ) E( r2 )... Optiml strtegy: : X A ( x) rg mx : Expected one step rewrd for input x (coin to ply next) nd the choice 4
5 RL with immedite rewrds The optiml choice ssumes we know the expected rewrd Then: Cvets ( x) rg mx We do not know the expected rewrd We need to estimte it using from interction We cnnot determine the optiml policy if the estimte of the expected rewrd is not good We need to try lso ctions tht look suboptiml wrt the ~ current estimtes of ~ Estimting Solution 1: For ech input x try different ctions Estimte using the verge of observed rewrds ) ~ 1 x, N N x, i1 r i Solution 2: online pproximtion Updtes n estimte fter performing ction in x nd x observing the rewrd r, R ~ ( ( i) (1 ( i)) R ~ ( ( i1) ( i) r i (i) - lerning rte 5
6 RL with immedite rewrds At ny step in time i during the experiment we hve estimtes of expected rewrds for ech (coin, ction) pir: coin1, hed) coin1, til) coin2, hed) coin2, til) coin3, hed) coin3, til) Assume the next coin to ply in step (i+1) is coin 2 nd we pick hed s our bet. Then we updte ~ ( i 1) coin2, hed) using the observed rewrd nd one of the updte strtegy bove, nd keep the rewrd estimtes for the remining (coin, ction) pirs unchnged, e.g. ~ ( i1) ~ ( ) coin2, til) coin2, til) i Explortion vs. Exploittion Uniform explortion: Uses explortion prmeter 0 1 Choose the current best choice with probbility ˆ ( x) rg mx R ~ ( A 1 All other choices re selected with uniform probbility A 1 Advntges: Simple, esy to implement Disdvntges: Explortion more pproprite t the beginning when we do not ~ hve good estimtes of Exploittion more pproprite lter when we hve good estimtes 6
7 Explortion vs. Exploittion Boltzmn explortion The ction is chosen rndomly but proportionlly to its current expected rewrd estimte Cn be tuned with temperture prmeter T to promote explortion or exploittion Probbility of choosing ction expr ~ ( / T p( x) ~ expr ( ') / T ' A Effect of T: For high vlues of T, p( x) is uniformly distributed for ll ctions For low vlues of T, p( x) of the ction with the highest ~ vlue of is pproching 1 Agent nvigtion exmple Agent nvigtion in the mze: 4 moves in compss directions Effects of moves re stochstic we my wind up in other thn intended loction with non-zero probbility Objective: lern how to rech the gol stte in the shortest expected time moves G 7
8 Agent nvigtion exmple The RL model: Input: X position of n gent Output: A the next move Reinforcements: R -1 for ech move +100 for reching the gol A policy: : X A Gol: find the policy mximizing future expected rewrds E( 0 t t rt ) : Position 1 Position 2 Position 25 G right right left 0 1 moves Agent nvigtion exmple Stte, ction rewrd trjectories policy : Position 1 Position 2 Position 25 right right left G moves Step0 Step1 Step2 Step k stte ction Pos1 Right Pos2 Right Pos3 Up.. Pos15 Up.. rewrd
9 Lerning with delyed rewrds Action in ddition to immedite rewrds ffect the next stte of the environment nd thus indirectly lso future rewrds We need model to represent environment chnges nd the the effect of ctions on sttes nd rewrds ssocited with them Mrkov decision process (MDP) Frequently used in AI, OR, control theory ction t-1 stte t-1 stte t rewrd t-1 Mrkov decision process ction t-1 stte t-1 stte t Forml definition: ( S, A, T, R) A set of sttes S (X ) loctions of robot A set of ctions move ctions Trnsition model S AS [0,1] where cn I get with different moves Rewrd model S A S rewrd/cost for trnsition A 4-tuple rewrd t-1 9
10 MDP problem We wnt to find the best policy : S A Vlue function ( V ) for policy, quntifies the goodness of policy through, e.g. infinite horizon, discounted model It: E( 0 t 1. combines future rewrds over trjectory 2. combines rewrds for multiple trjectories (through expecttion-bsed mesures) t rt ) G G Vlue of policy for MDP Assume fixed policy : S A How to compute the vlue of policy under infinite horizon discounted model? A fixed point eqution: V ( s) ( s)) S P( ( s)) V ( ) expected one step rewrd for the first ction v r Uv expected discounted rewrd for following the policy for the rest of the steps v ( I U) 1 r For finite stte spce we get set of liner equtions 10
11 Optiml policy The vlue of the optiml policy V ( s) mx P( V ( ) A S expected one step rewrd for the first ction expected discounted rewrd for following the opt. policy for the rest of the steps The optiml policy: :S A ( s) rg mx P( V ( ) A S Computing optiml policy Dynmic progrmming: Vlue itertion: computes the optiml vlue function first then the policy itertive pproximtion converges to the optiml vlue function Vlue itertion ( ) initilize V ;; V is vector of vlues for ll sttes repet set V' V set V ( s) mx P( V '( ) A S until V' V output ( s) rg mx P( V ( ) A S 11
12 Reinforcement lerning of optiml policies In the RL frmework we do not know the MDP model!!! Gol: lern the optiml policy Two bsic pproches: : S A Model bsed lerning Lern the MDP model (probbilitie rewrds) first Solve the MDP fterwrds Model-free lerning Lern how to ct directly No need to lern the prmeters of the MDP A number of clones of the two in the literture Model-bsed lerning We need to lern trnsition probbilities nd rewrds Lerning of probbilities ML prmeter estimtes Use counts Ns s P ~,, ' ( N s N Lerning rewrds Similr to lerning with immedite rewrds R ~ 1 ( N N s, i1 r i,, S Problem: chnges in the probbilities nd rewrd estimtes would require us to solve n MDP from scrtch! (fter every ction nd rewrd seen) N or the online solution 12
13 Model free lerning Motivtion: vlue function updte (vlue itertion): V Let ( s) mx A Q( S S P( V P( V ( ) ( ) Then V ( s) mx Q( A Note tht the updte cn be defined purely in terms of Q- functions Q( S P( mx Q(, ') ' Q-lerning Q-lerning uses the Q-vlue updte ide But relies on stochstic (on-line, smple by smple) updte Q( is replced with S P( mx Q(, ') Qˆ( (1 ) Qˆ( r( mx Qˆ(, ') r( - rewrd received from the environment fter performing n ction in stte s - new stte reched fter ction - lerning rte, function of s N, - number of times hs been executed t s ' ' 13
14 Q-function updtes in Q-lerning At ny step in time i during the experiment we hve estimtes of Q functions for ech (stte, ction) pir: Q( position1, up) ( position 1, left) ( position 1, right ) ( position 1, down) Q Q Q Q( position 2, up) Assume the current stte is position 1 nd we pick up ction to be performed next. ~ After we observe the rewrd, we updte Q( position 1, up), nd keep the Q function estimtes for the remining (stte, ction) pirs unchnged. Q-lerning The on-line updte rule is pplied repetedly during the direct interction with n environment Q-lerning initilize Q( =0 for ll pirs observe current stte s repet select ction ; use some explortion/exploittion schedule receive rewrd r observe next stte s updte Q( (1 ) Q( r mx Q(, ') ' set s to s end repet 14
15 Q-lerning convergence The Q-lerning is gurnteed to converge to the optiml Q- vlues under the following conditions: Every stte is visited nd every ction in tht stte is tried infinite number of times This is ssured vi explortion/exploittion schedule The sequence of lerning rtes for ech Q( stisfies: i1 1. ( i) 2. i1 (i) 2 ( n( ) - is the lerning rte for the nth tril of ( RL with delyed rewrds The optiml choice ( s) rg mx Q( much like wht we hd for the immedite rewrds ( x) rg mx RL Lerning Insted of exct vlues of Q( we use Since we hve only estimtes of Qˆ ( We need to try lso ctions tht look suboptiml wrt the current estimtes Explortion/exploittion strtegies Uniform explortion Boltzmn explortion Qˆ ( Qˆ( (1 ) Qˆ( r( mx Qˆ(, ') ' 15
16 Q-lerning speed-ups The bsic Q-lerning rule updtes my propgte distnt (delyed) rewrds very slowly Exmple: G Gol: high rewrd stte To mke the correct decision we need ll Q-vlues for the current position to be good Problem: in ech run we bck-propgte vlues only one-step bck. It tkes multiple trils to bck-propgte vlues multiple steps. Q-lerning speed-ups Remedy: Bckup vlues for lrger number of steps Rewrds from pplying the policy 2 qt rt r 1 r... t t2 i0 r i ti We cn substitute (immedite rewrds with n-step rewrds): n n 1 q i n t r ti mx Qt n(, ') i0 ' Postpone the updte for n steps nd updte with longer trjectory rewrds n Qt n1( Qt n( q t Qt n( Problems: - lrger vrince - explortion/exploittion switching - wit n steps to updte 16
17 Q-lerning speed-ups One step vs. n-step bckup G G Problems with n-step bckups: - lrger vrince - explortion/exploittion switching - wit n steps to updte Q-lerning speed-ups Temporl difference (TD) method Remedy of the wit n-steps problem Prtil bck-up fter every simultion step Similr ide: wether forecst djustment G Different versions of this ide hs been implemented 17
18 RL successes Reinforcement lerning is reltively simple On-line techniques cn trck non-sttionry environments nd dpt to its chnges Successful pplictions: Deep Mind s AlphGo (Alph Zero) TD Gmmon lerned to ply bckgmmon on the chmpionship level Elevtor control Dynmic chnnel lloction in mobile telephony Robot nvigtion in the environment 18
Reinforcement Learning
Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm
More informationAdministrivia CSE 190: Reinforcement Learning: An Introduction
Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these
More informationBellman Optimality Equation for V*
Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s
More information{ } = E! & $ " k r t +k +1
Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,
More informationChapter 4: Dynamic Programming
Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,
More information2D1431 Machine Learning Lab 3: Reinforcement Learning
2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed
More informationReinforcement learning
Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern
More informationModule 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:
More informationArtificial Intelligence Markov Decision Problems
rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome
More information19 Optimal behavior: Game theory
Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,
More informationReinforcement Learning and Policy Reuse
Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez
More informationMulti-Armed Bandits: Non-adaptive and Adaptive Sampling
CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent
More informationDecision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees
CS 188: Artificil Intelligence Fll 2011 Decision Networks ME: choose the ction which mximizes the expected utility given the evidence mbrell Lecture 17: Decision Digrms 10/27/2011 Cn directly opertionlize
More informationCS 188: Artificial Intelligence Spring 2007
CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment
More information1 Online Learning and Regret Minimization
2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in
More informationCf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.
Cf. Linn Sennott, Stochstic Dynmic Progrmming nd the Control of Queueing Systems, Wiley Series in Probbility & Sttistics, 1999. D.L.Bricker, 2001 Dept of Industril Engineering The University of Iow MDP
More informationThis lecture covers Chapter 8 of HMU: Properties of CFLs
This lecture covers Chpter 8 of HMU: Properties of CFLs Turing Mchine Extensions of Turing Mchines Restrictions of Turing Mchines Additionl Reding: Chpter 8 of HMU. Turing Mchine: Informl Definition B
More informationChapter 5 : Continuous Random Variables
STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll
More informationContinuous Random Variables
STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht
More informationAQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system
Complex Numbers Section 1: Introduction to Complex Numbers Notes nd Exmples These notes contin subsections on The number system Adding nd subtrcting complex numbers Multiplying complex numbers Complex
More informationReview of Calculus, cont d
Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some
More informationCS 188: Artificial Intelligence
CS 188: Artificil Intelligence Lecture 19: Decision Digrms Pieter Abbeel --- C Berkeley Mny slides over this course dpted from Dn Klein, Sturt Russell, Andrew Moore Decision Networks ME: choose the ction
More informationSolution for Assignment 1 : Intro to Probability and Statistics, PAC learning
Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (
More informationBayesian Networks: Approximate Inference
pproches to inference yesin Networks: pproximte Inference xct inference Vrillimintion Join tree lgorithm pproximte inference Simplify the structure of the network to mkxct inferencfficient (vritionl methods,
More informationActor-Critic. Hung-yi Lee
Actor-Critic Hung-yi Lee Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous
More informationCS 188: Artificial Intelligence Fall 2010
CS 188: Artificil Intelligence Fll 2010 Lecture 18: Decision Digrms 10/28/2010 Dn Klein C Berkeley Vlue of Informtion 1 Decision Networks ME: choose the ction which mximizes the expected utility given
More informationDecision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information
CS 188: Artificil Intelligence nd Vlue of Informtion Instructors: Dn Klein nd Pieter Abbeel niversity of Cliforni, Berkeley [These slides were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI
More informationScalable Learning in Stochastic Games
Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, 15213-3891 Abstrct Stochstic gmes re generl model of interction between
More informationPlanning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments
Plnning to Be Surprised: Optiml Byesin Explortion in Dynmic Environments Yi Sun, Fustino Gomez, nd Jürgen Schmidhuber IDSIA, Glleri 2, Mnno, CH-6928, Switzerlnd {yi,tino,juergen}@idsi.ch Abstrct. To mximize
More informationCS667 Lecture 6: Monte Carlo Integration 02/10/05
CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of
More informationCS 188: Artificial Intelligence Fall Announcements
CS 188: Artificil Intelligence Fll 2009 Lecture 20: Prticle Filtering 11/5/2009 Dn Klein UC Berkeley Announcements Written 3 out: due 10/12 Project 4 out: due 10/19 Written 4 proly xed, Project 5 moving
More informationWe will see what is meant by standard form very shortly
THEOREM: For fesible liner progrm in its stndrd form, the optimum vlue of the objective over its nonempty fesible region is () either unbounded or (b) is chievble t lest t one extreme point of the fesible
More informationMath& 152 Section Integration by Parts
Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible
More informationAcceptance Sampling by Attributes
Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationNumerical Analysis: Trapezoidal and Simpson s Rule
nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =
More informationLECTURE NOTE #12 PROF. ALAN YUILLE
LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of
More informationHybrid Control and Switched Systems. Lecture #2 How to describe a hybrid system? Formal models for hybrid system
Hyrid Control nd Switched Systems Lecture #2 How to descrie hyrid system? Forml models for hyrid system João P. Hespnh University of Cliforni t Snt Brr Summry. Forml models for hyrid systems: Finite utomt
More informationLecture 3 Gaussian Probability Distribution
Introduction Lecture 3 Gussin Probbility Distribution Gussin probbility distribution is perhps the most used distribution in ll of science. lso clled bell shped curve or norml distribution Unlike the binomil
More information1 Probability Density Functions
Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our
More informationMath Calculus with Analytic Geometry II
orem of definite Mth 5.0 with Anlytic Geometry II Jnury 4, 0 orem of definite If < b then b f (x) dx = ( under f bove x-xis) ( bove f under x-xis) Exmple 8 0 3 9 x dx = π 3 4 = 9π 4 orem of definite Problem
More informationMath 8 Winter 2015 Applications of Integration
Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl
More informationODE: Existence and Uniqueness of a Solution
Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =
More informationChapter 0. What is the Lebesgue integral about?
Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous
More informationApplying Q-Learning to Flappy Bird
Applying Q-Lerning to Flppy Bird Moritz Ebeling-Rump, Mnfred Ko, Zchry Hervieux-Moore Abstrct The field of mchine lerning is n interesting nd reltively new re of reserch in rtificil intelligence. In this
More informationDATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018
DATA620006 魏忠钰 Serch I Mrch 7 th, 2018 Outline Serch Problems Uninformed Serch Depth-First Serch Bredth-First Serch Uniform-Cost Serch Rel world tsk - Pc-mn Serch problems A serch problem consists of:
More informationName Solutions to Test 3 November 8, 2017
Nme Solutions to Test 3 November 8, 07 This test consists of three prts. Plese note tht in prts II nd III, you cn skip one question of those offered. Some possibly useful formuls cn be found below. Brrier
More informationCMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature
CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy
More informationEfficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
Efficient Plnning 1 Tuesdy clss summry: Plnning: ny computtionl process tht uses model to crete or improve policy Dyn frmework: 2 Questions during clss Why use simulted experience? Cn t you directly compute
More informationTHE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.
THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem
More informationLearning to Serve and Bounce a Ball
Sndr Amend Gregor Gebhrdt Technische Universität Drmstdt Abstrct In this pper we investigte lerning the tsks of bll serving nd bll bouncing. These tsks disply chrcteristics which re common in vriety of
More informationCS 188 Introduction to Artificial Intelligence Fall 2018 Note 7
CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees
More informationPh2b Quiz - 1. Instructions
Ph2b Winter 217-18 Quiz - 1 Due Dte: Mondy, Jn 29, 218 t 4pm Ph2b Quiz - 1 Instructions 1. Your solutions re due by Mondy, Jnury 29th, 218 t 4pm in the quiz box outside 21 E. Bridge. 2. Lte quizzes will
More informationHow do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?
XII. LINEAR ALGEBRA: SOLVING SYSTEMS OF EQUATIONS Tody we re going to tlk bout solving systems of liner equtions. These re problems tht give couple of equtions with couple of unknowns, like: 6 2 3 7 4
More informationEnergy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon
Phys463.nb 49 7 Energy Bnds Ref: textbook, Chpter 7 Q: Why re there insultors nd conductors? Q: Wht will hppen when n electron moves in crystl? In the previous chpter, we discussed free electron gses,
More informationNon-Linear & Logistic Regression
Non-Liner & Logistic Regression If the sttistics re boring, then you've got the wrong numbers. Edwrd R. Tufte (Sttistics Professor, Yle University) Regression Anlyses When do we use these? PART 1: find
More informationConvergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms
Mchine Lerning, 39, 287 308, 2000. c 2000 Kluwer Acdemic Publishers. Printed in The Netherlnds. Convergence Results for Single-Step On-Policy Reinforcement-Lerning Algorithms SATINDER SINGH AT&T Lbs-Reserch,
More informationMonte Carlo method in solving numerical integration and differential equation
Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The
More informationStudent Activity 3: Single Factor ANOVA
MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether
More informationProperties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives
Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn
More informationDuality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.
Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we
More informationIntegral equations, eigenvalue, function interpolation
Integrl equtions, eigenvlue, function interpoltion Mrcin Chrząszcz mchrzsz@cernch Monte Crlo methods, 26 My, 2016 1 / Mrcin Chrząszcz (Universität Zürich) Integrl equtions, eigenvlue, function interpoltion
More informationData Assimilation. Alan O Neill Data Assimilation Research Centre University of Reading
Dt Assimiltion Aln O Neill Dt Assimiltion Reserch Centre University of Reding Contents Motivtion Univrite sclr dt ssimiltion Multivrite vector dt ssimiltion Optiml Interpoltion BLUE 3d-Vritionl Method
More informationMath 113 Exam 2 Practice
Mth Em Prctice Februry, 8 Em will cover sections 6.5, 7.-7.5 nd 7.8. This sheet hs three sections. The first section will remind you bout techniques nd formuls tht you should know. The second gives number
More informationHow do you know you have SLE?
Simultneous Liner Equtions Simultneous Liner Equtions nd Liner Algebr Simultneous liner equtions (SLE s) occur frequently in Sttics, Dynmics, Circuits nd other engineering clsses Need to be ble to, nd
More informationOnline Markov Decision Processes under Bandit Feedback
Online Mrkov Decision Processes under Bndit Feedbck Gergely Neu, András György, Csb Szepesvári, András Antos Abstrct We consider online lerning in finite stochstic Mrkovin environments where in ech time
More information4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve
Dte: 3/14/13 Objective: SWBAT pply properties of exponentil functions nd will pply properties of rithms. Bell Ringer: Use your clcultor to solve 4 7x =250; 5 3x =500; HW Requests: Properties of Log Equtions
More informationStrong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation
Strong Bisimultion Overview Actions Lbeled trnsition system Trnsition semntics Simultion Bisimultion References Robin Milner, Communiction nd Concurrency Robin Milner, Communicting nd Mobil Systems 32
More informationHidden Markov Models
Hidden Mrkov Models Huptseminr Mchine Lerning 18.11.2003 Referent: Nikols Dörfler 1 Overview Mrkov Models Hidden Mrkov Models Types of Hidden Mrkov Models Applictions using HMMs Three centrl problems:
More informationLecture 19: Continuous Least Squares Approximation
Lecture 19: Continuous Lest Squres Approximtion 33 Continuous lest squres pproximtion We begn 31 with the problem of pproximting some f C[, b] with polynomil p P n t the discrete points x, x 1,, x m for
More informationNear-Bayesian Exploration in Polynomial Time
J. Zico Kolter kolter@cs.stnford.edu Andrew Y. Ng ng@cs.stnford.edu Computer Science Deprtment, Stnford University, CA 94305 Abstrct We consider the explortion/exploittion problem in reinforcement lerning
More informationImproper Integrals, and Differential Equations
Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted
More informationKNOWLEDGE-BASED AGENTS INFERENCE
AGENTS THAT REASON LOGICALLY KNOWLEDGE-BASED AGENTS Two components: knowledge bse, nd n inference engine. Declrtive pproch to building n gent. We tell it wht it needs to know, nd It cn sk itself wht to
More informationRead section 3.3, 3.4 Announcements:
Dte: 3/1/13 Objective: SWBAT pply properties of exponentil functions nd will pply properties of rithms. Bell Ringer: 1. f x = 3x 6, find the inverse, f 1 x., Using your grphing clcultor, Grph 1. f x,f
More informationZ b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...
Chpter 7 Numericl Methods 7. Introduction In mny cses the integrl f(x)dx cn be found by finding function F (x) such tht F 0 (x) =f(x), nd using f(x)dx = F (b) F () which is known s the nlyticl (exct) solution.
More informationReview of basic calculus
Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below
More informationSUMMER KNOWHOW STUDY AND LEARNING CENTRE
SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18
More informationJonathan Mugan. July 15, 2013
Jonthn Mugn July 15, 2013 Imgine rt in Skinner box. The rt cn see screen of imges, nd dot in the lower-right corner determines if there will be shock. Bottom-up methods my not find this dot, but top-down
More informationA Fast and Reliable Policy Improvement Algorithm
A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce
More informationNew data structures to reduce data size and search time
New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute
More informationGenetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary
Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed
More informationRiemann is the Mann! (But Lebesgue may besgue to differ.)
Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >
More informationJim Lambers MAT 169 Fall Semester Lecture 4 Notes
Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of
More informationA signalling model of school grades: centralized versus decentralized examinations
A signlling model of school grdes: centrlized versus decentrlized exmintions Mri De Pol nd Vincenzo Scopp Diprtimento di Economi e Sttistic, Università dell Clbri m.depol@unicl.it; v.scopp@unicl.it 1 The
More informationP 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)
1 Tylor polynomils In Section 3.5, we discussed how to pproximte function f(x) round point in terms of its first derivtive f (x) evluted t, tht is using the liner pproximtion f() + f ()(x ). We clled this
More informationChapters 4 & 5 Integrals & Applications
Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions
More information1 Error Analysis of Simple Rules for Numerical Integration
cs41: introduction to numericl nlysis 11/16/10 Lecture 19: Numericl Integrtion II Instructor: Professor Amos Ron Scries: Mrk Cowlishw, Nthnel Fillmore 1 Error Anlysis of Simple Rules for Numericl Integrtion
More informationExtended nonlocal games from quantum-classical games
Extended nonlocl gmes from quntum-clssicl gmes Theory Seminr incent Russo niversity of Wterloo October 17, 2016 Outline Extended nonlocl gmes nd quntum-clssicl gmes Entngled vlues nd the dimension of entnglement
More informationREINFORCEMENT learning (RL) was originally studied
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 385 Multiobjective Reinforcement Lerning: A Comprehensive Overview Chunming Liu, Xin Xu, Senior Member, IEEE, nd
More informationLecture 20: Numerical Integration III
cs4: introduction to numericl nlysis /8/0 Lecture 0: Numericl Integrtion III Instructor: Professor Amos Ron Scribes: Mrk Cowlishw, Yunpeng Li, Nthnel Fillmore For the lst few lectures we hve discussed
More informationThe Regulated and Riemann Integrals
Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue
More informationMath 270A: Numerical Linear Algebra
Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner
More informationLecture Note 9: Orthogonal Reduction
MATH : Computtionl Methods of Liner Algebr 1 The Row Echelon Form Lecture Note 9: Orthogonl Reduction Our trget is to solve the norml eution: Xinyi Zeng Deprtment of Mthemticl Sciences, UTEP A t Ax = A
More informationHow to simulate Turing machines by invertible one-dimensional cellular automata
How to simulte Turing mchines by invertible one-dimensionl cellulr utomt Jen-Christophe Dubcq Déprtement de Mthémtiques et d Informtique, École Normle Supérieure de Lyon, 46, llée d Itlie, 69364 Lyon Cedex
More informationA Continuous-time Markov Decision Process Based Method on Pursuit-Evasion Problem
Preprints of the th World Congress The Interntionl Federtion of Automtic Control Cpe Town, South Afric. August -, A Continuous-time Mrkov Decision Process Bsed Method on Pursuit-Evsion Problem Ji Shengde
More informationUninformed Search Lecture 4
Lecture 4 Wht re common serch strtegies tht operte given only serch problem? How do they compre? 1 Agend A quick refresher DFS, BFS, ID-DFS, UCS Unifiction! 2 Serch Problem Formlism Defined vi the following
More informationf(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral
Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one
More informationDiscrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17
EECS 70 Discrete Mthemtics nd Proility Theory Spring 2013 Annt Shi Lecture 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion,
More informationIn-Class Problems 2 and 3: Projectile Motion Solutions. In-Class Problem 2: Throwing a Stone Down a Hill
MASSACHUSETTS INSTITUTE OF TECHNOLOGY Deprtment of Physics Physics 8T Fll Term 4 In-Clss Problems nd 3: Projectile Motion Solutions We would like ech group to pply the problem solving strtegy with the
More informationNumerical integration
2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter
More information