Reinforcement learning crash course
|
|
- Douglas McDowell
- 5 years ago
- Views:
Transcription
1 crash course Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich,
2 Overview : rough overview mainly following Sutton & Barto 1998 Dopamine prediction errors and more Fitting behaviour with RL models hierarchical approaches
3 Setup Environment a t s t r t Agent {a t } argmax {a t } t=1 r t After Sutton and Barto 1998
4 State space Gold +1 Electric shocks -1
5 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
6 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
7 Actions Action left abs Action right T left = T right = Noisy: plants, environments, agent Absorbing state -> max eigenvalue < 1
8 Markovian dynamics p(s t+1 a t,s t,a t 1,s t 1,a t 2,s t 2, )=p(s t+1 a t,s t ) a t 2,s t 2 a t 1,s t 1 a t,s t Velocity s = [position] s = position velocity
9 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
10 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
11 Tall orders Aim: maximise total future reward 1X t=1 r t i.e. we have to sum over paths through the future and weigh each by its probability Best policy achieves best long-term reward
12 Exhaustive tree search w d
13 Decision tree 1X t=1 r t
14 Policy for this talk Pose the problem mathematically Policy evaluation Policy iteration Monte Carlo techniques: experience samples TD learning Policy Evaluate Update
15 Evaluating a policy Aim: maximise total future reward 1X t=1 r t To know which is best, evaluate it first The policy determines the expected reward from each state V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t #
16 Discounting Given a policy, each state has an expected value V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t # But: t=0 Episodic r t = TX r t < 1 t=0 Discounted infinite horizons t=0 t r t < finite, exponentially distributed horizons T t=0 t r t T 1 e t/
17 Markov Decision Problems V (s t ) = E " 1 X t 0 =1 r t 0 s t = s, # " 1 X # = E [r 1 s t = s, ]+E t=2 r t s t = s, = E [r 1 s t = s, ]+E [V (s t+1 ) s t = s, ] This dynamic consistency is key to many solution approaches. It states that the value of a state s is related to the values of its successor states s.
18 Markov Decision Problems V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] r 1 R(s 2,a 1,s 1 ) 2 3 E [r 1 s t = s, ] = E 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t p(a t s t ) X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t (a t,s t ) X T a t s t s t+1 R(s t+1,a t,s t ) 5 st+1
19 Bellman equation V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] 2 3 E [r 1 s t, ] = X (a, s t ) 4 X Ts a t s t+1 R(s t+1,a,s t ) 5 a st E [V (s t+1 ),,s t ] = X (a, s t ) 4 X Ts a t s t+1 V (s t+1 ) 5 a st+1 " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]
20 Bellman Equation All future reward from state s = E Immediate reward + All future reward from next state s " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]
21 Q values = state-action values V (s) = X a (a s) " X s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] {z } Q (s,a) # so we can define state-action values as: Q(s, a) = s = E T a ss [R(s, a, s)+v (s )] t=1 r t s, a and state values are average state-action values: V (s) = a (a s)q(s, a)
22 Bellman Equation " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] to evaluate a policy, we need to solve the above equation, i.e. find the self-consistent state values options for policy evaluation exhaustive tree search - outwards, inwards, depth-first value iteration: iterative updates linear solution in 1 step experience sampling
23 Solving the Bellman Equation Option 1: turn it into update equation V k+1 (s) = a (a, s t ) s T a ss R(s, a, s)+v k (s ) Option 2: linear solution (w/ absorbing states) V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] v = R + T v v = (I T ) 1 R O( S 3 )
24 Policy update Given the value function for a policy, say via linear solution V (s) = X " # X (a s) Tss a [R(s 0,a,s)+V (s 0 )] 0 a s 0 {z } Q (s,a) Given the values V for the policy, we can improve the policy by always choosing the best action: 0 (a s) = 1ifa = argmaxa Q (s, a) 0 else It is guaranteed to improve: Q (s, 0 (s)) = max a Q (s, a) for deterministic policy Q (s, (s)) = V (s)
25 Policy iteration Policy evaluation v = (I T ) 1 R V (s) = max a Value iteration s T a ss [R a ss + V (s )] (a s) = greedy policy improvement 1 if a = argmaxa 0 else s T a ss R a ss + V pi (s )
26 Model-free solutions So far we have assumed knowledge of R and T R and T are the model of the world, so we assume full knowledge of the dynamics and rewards in the environment What if we don t know them? We can still learn from state-action-reward samples we can learn R and T from them, and use our estimates to solve as above alternatively, we can directly estimate V or Q
27 Solving the Bellman Equation Option 3: sampling V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: more about this later... a = X k f(x k )p(x k ) X x (i) p(x)! â = 1 N i f(x (i) )
28 Learning from samples A new problem: exploration versus exploitation
29 Monte Carlo First visit MC randomly start in all states, generate paths, average for starting state only V(s) = 1 N X i ( T X t 0 =1 r i t 0 s 0 = s More efficient use of samples 10 9 Every visit MC 8 Bootstrap: TD 7 Dyna 6 Better samples 5 4 on policy versus off policy 3 Stochastic search, UCT... 2 )
30 Update equation: towards TD Bellman equation V (s) = a (a, s) s T a ss [R(s, a, s)+v (s )] Not yet converged, so it doesn t hold: dv (s) = V (s)+ (a, s) a s T a ss [R(s, a, s)+v (s )] And then use this to update V i+1 (s) =V i (s)+dv (s)
31 TD learning dv (s) = V (s)+ a (a, s) s T a ss [R(s, a, s)+v (s )] Sample s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t 1 (s t )+r t + V t 1 (s t+1 ) V i+1 (s) =V i (s)+dv (s) V t (s t )=V t 1 (s t )+ t
32 TD learning s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t (s t )+r t + V t (s t+1 ) V t+1 (s t )=V t (s t )+ t
33 SARSA Do TD for state-action values instead: Q(s t,a t ) Q(s t,a t )+ [r t + Q(s t+1,a t+1 ) Q(s t,a t )] s t,a t,r t,s t+1,a t+1 convergence guarantees - will estimate Q (s, a)
34 Q learning: off-policy Learn off-policy draw from some policy only require extensive sampling Q(s t,a t ) Q(s t,a t )+ will estimate Q (s, a) r t + max Q(s t+1,a) a update towards optimum Q(s t,a t )
35 The effect of bootstrapping B1 B1 B1 B1 B1 B1 B0 A0 B0 Markov (every visit) V(B)=3/4 V(A)=0 TD V(B)=3/4 V(A)=~3/4 Average over various bootstrappings: TD( ) after Sutton and Barto 1998
36 Conclusion Long-term rewards have internal consistency This can be exploited for solution Exploration and exploitation trade off when sampling Clever use of samples can produce fast learning Brain most likely does something like this
37 Fitting models to behaviour Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich,
38 Example task Probability correct Rewarded Go Nogo Go to Go to Nogo to Nogo to Win Avoid Win Avoid 1 Avoids loss Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) Think of it as four separate two-armed bandit tasks Guitart-Masip, Huys et al. 2012
39 Analysing behaviour Standard approach: Decide which feature of the data you care about Run descriptive statistical tests, e.g. ANOVA 1 Probability correct Go to Go to Nogo to Nogo to Win Avoid Win Avoid Many strengths Weakness Piecemeal, not holistic / global Descriptive, not generative No internal variables
40 Models Holistic Aim to model the process by which the data came about in its entirety Generative They can be run on the task to generate data as if a subject had done the task Inference process Capture the inference process subjects have to make to perform the task. Do this in sufficient detail to replicate the data. Parameters replace test statistics their meaning is explicit in the model
41 Actions Q values the process Probabilities link function Features: Q t (a t,s t )=Q t 1 (a t,s t )+ (r t Q t 1 (a t,s t )) p(a t s t,h t, ) = p(a t Q(a t,s t ), ) = p(a t s t ) / Q(a t,s t ) e Q(a t,s t ) P a 0 e Q(a0,s t ) 0 apple p(a) apple 1 links learning process and observations choices, RTs, or any other data
42 Fitting models I Maximum likelihood (ML) parameters ˆ = argmax L( ) where the likelihood of all choices is: L( ) = log p({a t } T t=1 {s t } T t=1, {r t } T t=1, {z} ), = log p({a t } T t=1 {Q(s t,a t ; )} T t=1, ) TY = log p(a t Q(s t,a t ; ), ) = t=1 TX log p(a t Q(s t,a t ; ), ) t=1
43 Fitting models II No closed form Use your favourite method gradients fminunc / fmincon... Gradients for RW model dl( ) d dq t (a t,s t ; ) d = d d = X t X log p(a t Q t (a t,s t ; ), ) t d d Q t(a t,s t ; ) = (1 ) dq t 1(a t,s t ; ) d X a 0 p(a 0 Q t (a 0,s t ; ), ) d d Q t(a 0,s t ; ) +(r t Q t 1 (a t,s t ; ))
44 Little tricks Transform your variables = e = log( ) 1 = 1+e = log 1 d log L( ) d Avoid over/underflow y(a) = Q(a) y m = max a p = e y(a) P y(a) = ey(a) ym P b ey(b) b ey(b) y m
45 ML characteristics ML is asymptotically consistent, but variance high 10-armed bandit, infer beta and epsilon trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 Estimate L( = 10) L( = 100) Run
46 Priors Not so smooth Smooth Prior Probability Prior Probability Parameter Parameter 0.8 Memory Switches 60 Dombrovski et al. 2010
47 Maximum a posteriori estimate P( )=p( a 1...T )= p(a 1...T )p( ) d p( a 1...T )p( ) log P( )= T t=1 log p(a t ) + log p( )+const. log P( ) d = log L( ) d + dp( ) d If likelihood is strong, prior will have little effect mainly has influence on poorly constrained parameters if a parameter is strongly constrained to be outside the typical range of the prior, then it will win over the prior
48 Maximum a posteriori estimate $ )*+,-.+/!!$!%! "!! #!!! #"!! $!!! &'( 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 mbeta=0, meps=-3, n=1
49 But What prior parameters should I use?
50 Hierarchical estimation - random effects i µ, A i A. A. A. A i A. A. A. i µ, µ, µ, i i i A. A. A. A. A. A. A. A. A. A. A. A. Fixed effect conflates within- and between- subject variability Average behaviour disregards between-subject variability need to adapt model Summary statistic treat parameters as random variable, one for each subject overestimates group variance as ML estimates noisy Random effects prior mean = group mean Z p(a i µ, ) = A. A. A. A. d i p(a i i ) p( i µ µ, ) {z }
51 Random effects See subjects as drawn from group Fixed models all the same: fixed effect wrt model parametrically nested A. µ, i M A. A. A. Q(a, s) =! 1 Q 1 (a, s)+! 2 Q 2 (a, s) assumes within-subject mixture, rather than a group mixture of perfect types Random effects in models µ, µ, µ, A T K i i i A. A. A. A. A. A. A. A. A. A. A. A.
52 Estimating the hyperparameters Effectively we now want to do gradient ascent on: d d p(a ) But this contains an integral over individual parameters: p(a ) = d p(a ) p( ) So we need to: ˆ = argmax p(a ) = argmax d p(a ) p( )
53 Inference ˆ = argmax p(a ) = argmax d p(a ) p( ) analytical - rare brute force - for simple problems Expectation Maximisation - approximate, easy Variational Bayes Sampling / MCMC
54 Expectation Maximisation Z log p(a ) = log d p(a, ) Z p(a, ) = log d q( ) q( ) Z p(a, ) d q( ) log q( ) Jensen s inequality k th Estep:q (k+1) ( ) p( A, (k) ) Z k th Mstep: (k+1) argmax d q( ) log p(a, ) Iterate between Estimating MAP parameters given prior parameters Estimating prior parameters from MAP parameters
55 Bayesian Information Criterion Laplace s approximation (saddle-point method) Y log(y) X Just a Gaussian X dx f(x) f (x 0 ) 2 2
56 EM with Laplace approximation E step: only need sufficient statistics to perform M step p( A, (k) ) N (m k, S k ) Approximate and hence: q (k+1) ( ) p( A, (k) ) E step: q k ( ) = N (m k, S k ) m k argmax p(a k )p( (i) ) S 1 k 2 p(a k )p( (i) ) 2 =m k matlab: [m,l,,,s]=fminunc( ) Just what we had before: MAP inference given some prior parameters
57 EM with Laplace approximation Updates Prior mean = mean of MAP estimates M step: (i+1) µ = 1 K m k (i+1) 2 = 1 N k i (m k ) 2 + S k ( (i+1) µ ) 2 Prior variance depends on inverse Hessian S and variance of MAP estimates Take uncertainty of estimates into account And now iterate until convergence
58 Parameter recovery Simulation 1 truly uncorrelated, no GLM Simulation 4 truly correlated, no GLM A B C β 1 correlation D βʼ correlation αʼ correlation true inferred true inferred true inferred T ω correlation true inferred E T F T G T H T Rescorla-Wagner Model T T step model T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 EM-MAP MAP0 ML
59 Correlations A Correlation ML βʼ and αʼ C Correlation MAP βʼʼ and αʼʼ E Prior covariance MAP βʼʼ and αʼʼ RW B T Correlation ML β 1ʼ and ωʼ D T Correlation MAP β 1ʼ and ωʼ F T Prior covariance MAP β 1ʼ and ωʼ 2-step T T T
60 Are parameters ok for correlations? Individual subject parameter estimates NO LONGER INDEPENDENT! Change group -> change parameter estimates compare different params if different priors correlations, t-tests within same prior ok A B true RW 2-step ML 0.1 C EM-MAP p value D MAP p value p value p value Nsj = 10 Nsj = 30 Nsj = 100
61 GLM So far infer individual parameters apply standard tests Alternative View as variation across group Specific - more powerful? µ, µ i = µ Group + i i Infer A. A. A. A.
62 GLM Group-level regressor A GLM estimate GLM estimate C T B GLM estimate GLM estimate T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = T D T E F Nsj = 10 Nsj = 30 Nsj = 100
63 Fitting - how to Write your likelihood function matlab examples attached with emfit.m don t do 20 ML fits! pass it into emfit.m or julia version validate: generate data with fitted params compare, have a look, does it look right? re-fit - is it stable? model comparison now: look at parameters, do correlations etc. Future: GLM full random effects over models and parameters jointly? Daniel Schad E Count Reward sensitivity ρ Learning rate ε p value
64 Hierarchical / random effects models Advantages Accurate group-level mean and variance Outliers due to weak likelihood are regularised Strong outliers are not Useful for model selection Disadvantages Individual estimates depend on other data, i.e. on A j6=i and i therefore need to be careful in interpreting these as summary statistics More involved; less transparent Psychiatry Groups often not well defined, covariates better fmri Shrink variance of ML estimates - fixed effects better still?
65 How does it do? LL Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq Sep:βα Q Sep:βαQ 2βα2 GoOnly βα2 GoOnly Q βα GoOnly Q βα4 Q βα Q βαq
66 Overfitting Y X
67 Model comparison A fit by itself is not meaningful Generative test qualitative Comparisons vs random vs other model -> test specific hypotheses and isolate particular effects in a generative setting
68 Model comparison Averaged over its parameter settings, how well does the model fit the data? p(a M) = Model comparison: Bayes factors Z d p(a ) p( M) BF = p(a M 1) p(a M 2 ) Problem: integral rarely solvable approximation: Laplace, sampling, variational...
69 Why integrals? The God Almighty test 0.1 Powerful model 0.1 Weak model N (p(x 1)+p(X 2 )+ ) These two factors fight it out Model complexity vs model fit
70 Group-level BIC log p(a M) = d p(a ) p( M) 1 2 BIC int = log ˆp(A ˆML ) 1 M log( A ) 2 Very simple 1) EM to estimate group prior mean & variance simply done using fminunc, which provides Hessians 2) Sample from estimated priors 3) Average
71 How does it do? LL BIC int Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq βαq βα Q βα4 Q βα GoOnly Q βα2 GoOnly Q 2βα2 GoOnly Sep:βαQ Sep:βα Q Σ i BIC i LL Int Fitted by EM... too nice?
72 Group Model selection Integrate out your parameters
73 Model comparison: overfitting? G Nogo Rewarded RW RW + noise RW + noise + Q0 Avoids loss RW + noise + bias RW (rew/pun) + noise + bias RW + noise + bias + Pav BIC int Note: same number of parameters 1 Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go)
74 Behavioural data modelling Are no panacea statistics about specific aspects of decision machinery only account for part of the variance Model needs to match experiment ensure subjects actually do the task the way you wrote it in the model model comparison Model = Quantitative hypothesis strong test need to compare models, not parameters includes all consequences of a hypothesis for choice
75 Thanks Peter Dayan Daniel Schad Nathaniel Daw SNSF DFG
Modelling behavioural data
Modelling behavioural data Quentin Huys MA PhD MBBS MBPsS Translational Neuromodeling Unit, ETH Zürich Psychiatrische Universitätsklinik Zürich Outline An example task Why build models? What is a model
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationChapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationKey concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate
Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associated to the value Teaching signal : difference between
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationSRNDNA Model Fitting in RL Workshop
SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of D learning with
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationAnimal learning theory
Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationMonte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.
Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationIntroduction to Reinforcement Learning. Part 5: Temporal-Difference Learning
Introduction to Reinforcement Learning Part 5: emporal-difference Learning What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationOlivier Sigaud. September 21, 2012
Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationReinforcement Learning. Summer 2017 Defining MDPs, Planning
Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods
Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationFood delivered. Food obtained S 3
Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationThe Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo
CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationIntroduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming
Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationProbabilistic Planning. George Konidaris
Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationReinforcement Learning. Value Function Updates
Reinforcement Learning Value Function Updates Manfred Huber 2014 1 Value Function Updates Different methods for updating the value function Dynamic programming Simple Monte Carlo Temporal differencing
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More information