Reinforcement learning crash course

Size: px
Start display at page:

Download "Reinforcement learning crash course"

Transcription

1 crash course Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich,

2 Overview : rough overview mainly following Sutton & Barto 1998 Dopamine prediction errors and more Fitting behaviour with RL models hierarchical approaches

3 Setup Environment a t s t r t Agent {a t } argmax {a t } t=1 r t After Sutton and Barto 1998

4 State space Gold +1 Electric shocks -1

5 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

6 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

7 Actions Action left abs Action right T left = T right = Noisy: plants, environments, agent Absorbing state -> max eigenvalue < 1

8 Markovian dynamics p(s t+1 a t,s t,a t 1,s t 1,a t 2,s t 2, )=p(s t+1 a t,s t ) a t 2,s t 2 a t 1,s t 1 a t,s t Velocity s = [position] s = position velocity

9 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

10 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

11 Tall orders Aim: maximise total future reward 1X t=1 r t i.e. we have to sum over paths through the future and weigh each by its probability Best policy achieves best long-term reward

12 Exhaustive tree search w d

13 Decision tree 1X t=1 r t

14 Policy for this talk Pose the problem mathematically Policy evaluation Policy iteration Monte Carlo techniques: experience samples TD learning Policy Evaluate Update

15 Evaluating a policy Aim: maximise total future reward 1X t=1 r t To know which is best, evaluate it first The policy determines the expected reward from each state V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t #

16 Discounting Given a policy, each state has an expected value V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t # But: t=0 Episodic r t = TX r t < 1 t=0 Discounted infinite horizons t=0 t r t < finite, exponentially distributed horizons T t=0 t r t T 1 e t/

17 Markov Decision Problems V (s t ) = E " 1 X t 0 =1 r t 0 s t = s, # " 1 X # = E [r 1 s t = s, ]+E t=2 r t s t = s, = E [r 1 s t = s, ]+E [V (s t+1 ) s t = s, ] This dynamic consistency is key to many solution approaches. It states that the value of a state s is related to the values of its successor states s.

18 Markov Decision Problems V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] r 1 R(s 2,a 1,s 1 ) 2 3 E [r 1 s t = s, ] = E 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t p(a t s t ) X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t (a t,s t ) X T a t s t s t+1 R(s t+1,a t,s t ) 5 st+1

19 Bellman equation V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] 2 3 E [r 1 s t, ] = X (a, s t ) 4 X Ts a t s t+1 R(s t+1,a,s t ) 5 a st E [V (s t+1 ),,s t ] = X (a, s t ) 4 X Ts a t s t+1 V (s t+1 ) 5 a st+1 " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]

20 Bellman Equation All future reward from state s = E Immediate reward + All future reward from next state s " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]

21 Q values = state-action values V (s) = X a (a s) " X s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] {z } Q (s,a) # so we can define state-action values as: Q(s, a) = s = E T a ss [R(s, a, s)+v (s )] t=1 r t s, a and state values are average state-action values: V (s) = a (a s)q(s, a)

22 Bellman Equation " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] to evaluate a policy, we need to solve the above equation, i.e. find the self-consistent state values options for policy evaluation exhaustive tree search - outwards, inwards, depth-first value iteration: iterative updates linear solution in 1 step experience sampling

23 Solving the Bellman Equation Option 1: turn it into update equation V k+1 (s) = a (a, s t ) s T a ss R(s, a, s)+v k (s ) Option 2: linear solution (w/ absorbing states) V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] v = R + T v v = (I T ) 1 R O( S 3 )

24 Policy update Given the value function for a policy, say via linear solution V (s) = X " # X (a s) Tss a [R(s 0,a,s)+V (s 0 )] 0 a s 0 {z } Q (s,a) Given the values V for the policy, we can improve the policy by always choosing the best action: 0 (a s) = 1ifa = argmaxa Q (s, a) 0 else It is guaranteed to improve: Q (s, 0 (s)) = max a Q (s, a) for deterministic policy Q (s, (s)) = V (s)

25 Policy iteration Policy evaluation v = (I T ) 1 R V (s) = max a Value iteration s T a ss [R a ss + V (s )] (a s) = greedy policy improvement 1 if a = argmaxa 0 else s T a ss R a ss + V pi (s )

26 Model-free solutions So far we have assumed knowledge of R and T R and T are the model of the world, so we assume full knowledge of the dynamics and rewards in the environment What if we don t know them? We can still learn from state-action-reward samples we can learn R and T from them, and use our estimates to solve as above alternatively, we can directly estimate V or Q

27 Solving the Bellman Equation Option 3: sampling V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: more about this later... a = X k f(x k )p(x k ) X x (i) p(x)! â = 1 N i f(x (i) )

28 Learning from samples A new problem: exploration versus exploitation

29 Monte Carlo First visit MC randomly start in all states, generate paths, average for starting state only V(s) = 1 N X i ( T X t 0 =1 r i t 0 s 0 = s More efficient use of samples 10 9 Every visit MC 8 Bootstrap: TD 7 Dyna 6 Better samples 5 4 on policy versus off policy 3 Stochastic search, UCT... 2 )

30 Update equation: towards TD Bellman equation V (s) = a (a, s) s T a ss [R(s, a, s)+v (s )] Not yet converged, so it doesn t hold: dv (s) = V (s)+ (a, s) a s T a ss [R(s, a, s)+v (s )] And then use this to update V i+1 (s) =V i (s)+dv (s)

31 TD learning dv (s) = V (s)+ a (a, s) s T a ss [R(s, a, s)+v (s )] Sample s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t 1 (s t )+r t + V t 1 (s t+1 ) V i+1 (s) =V i (s)+dv (s) V t (s t )=V t 1 (s t )+ t

32 TD learning s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t (s t )+r t + V t (s t+1 ) V t+1 (s t )=V t (s t )+ t

33 SARSA Do TD for state-action values instead: Q(s t,a t ) Q(s t,a t )+ [r t + Q(s t+1,a t+1 ) Q(s t,a t )] s t,a t,r t,s t+1,a t+1 convergence guarantees - will estimate Q (s, a)

34 Q learning: off-policy Learn off-policy draw from some policy only require extensive sampling Q(s t,a t ) Q(s t,a t )+ will estimate Q (s, a) r t + max Q(s t+1,a) a update towards optimum Q(s t,a t )

35 The effect of bootstrapping B1 B1 B1 B1 B1 B1 B0 A0 B0 Markov (every visit) V(B)=3/4 V(A)=0 TD V(B)=3/4 V(A)=~3/4 Average over various bootstrappings: TD( ) after Sutton and Barto 1998

36 Conclusion Long-term rewards have internal consistency This can be exploited for solution Exploration and exploitation trade off when sampling Clever use of samples can produce fast learning Brain most likely does something like this

37 Fitting models to behaviour Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich,

38 Example task Probability correct Rewarded Go Nogo Go to Go to Nogo to Nogo to Win Avoid Win Avoid 1 Avoids loss Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) Think of it as four separate two-armed bandit tasks Guitart-Masip, Huys et al. 2012

39 Analysing behaviour Standard approach: Decide which feature of the data you care about Run descriptive statistical tests, e.g. ANOVA 1 Probability correct Go to Go to Nogo to Nogo to Win Avoid Win Avoid Many strengths Weakness Piecemeal, not holistic / global Descriptive, not generative No internal variables

40 Models Holistic Aim to model the process by which the data came about in its entirety Generative They can be run on the task to generate data as if a subject had done the task Inference process Capture the inference process subjects have to make to perform the task. Do this in sufficient detail to replicate the data. Parameters replace test statistics their meaning is explicit in the model

41 Actions Q values the process Probabilities link function Features: Q t (a t,s t )=Q t 1 (a t,s t )+ (r t Q t 1 (a t,s t )) p(a t s t,h t, ) = p(a t Q(a t,s t ), ) = p(a t s t ) / Q(a t,s t ) e Q(a t,s t ) P a 0 e Q(a0,s t ) 0 apple p(a) apple 1 links learning process and observations choices, RTs, or any other data

42 Fitting models I Maximum likelihood (ML) parameters ˆ = argmax L( ) where the likelihood of all choices is: L( ) = log p({a t } T t=1 {s t } T t=1, {r t } T t=1, {z} ), = log p({a t } T t=1 {Q(s t,a t ; )} T t=1, ) TY = log p(a t Q(s t,a t ; ), ) = t=1 TX log p(a t Q(s t,a t ; ), ) t=1

43 Fitting models II No closed form Use your favourite method gradients fminunc / fmincon... Gradients for RW model dl( ) d dq t (a t,s t ; ) d = d d = X t X log p(a t Q t (a t,s t ; ), ) t d d Q t(a t,s t ; ) = (1 ) dq t 1(a t,s t ; ) d X a 0 p(a 0 Q t (a 0,s t ; ), ) d d Q t(a 0,s t ; ) +(r t Q t 1 (a t,s t ; ))

44 Little tricks Transform your variables = e = log( ) 1 = 1+e = log 1 d log L( ) d Avoid over/underflow y(a) = Q(a) y m = max a p = e y(a) P y(a) = ey(a) ym P b ey(b) b ey(b) y m

45 ML characteristics ML is asymptotically consistent, but variance high 10-armed bandit, infer beta and epsilon trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 Estimate L( = 10) L( = 100) Run

46 Priors Not so smooth Smooth Prior Probability Prior Probability Parameter Parameter 0.8 Memory Switches 60 Dombrovski et al. 2010

47 Maximum a posteriori estimate P( )=p( a 1...T )= p(a 1...T )p( ) d p( a 1...T )p( ) log P( )= T t=1 log p(a t ) + log p( )+const. log P( ) d = log L( ) d + dp( ) d If likelihood is strong, prior will have little effect mainly has influence on poorly constrained parameters if a parameter is strongly constrained to be outside the typical range of the prior, then it will win over the prior

48 Maximum a posteriori estimate $ )*+,-.+/!!$!%! "!! #!!! #"!! $!!! &'( 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 mbeta=0, meps=-3, n=1

49 But What prior parameters should I use?

50 Hierarchical estimation - random effects i µ, A i A. A. A. A i A. A. A. i µ, µ, µ, i i i A. A. A. A. A. A. A. A. A. A. A. A. Fixed effect conflates within- and between- subject variability Average behaviour disregards between-subject variability need to adapt model Summary statistic treat parameters as random variable, one for each subject overestimates group variance as ML estimates noisy Random effects prior mean = group mean Z p(a i µ, ) = A. A. A. A. d i p(a i i ) p( i µ µ, ) {z }

51 Random effects See subjects as drawn from group Fixed models all the same: fixed effect wrt model parametrically nested A. µ, i M A. A. A. Q(a, s) =! 1 Q 1 (a, s)+! 2 Q 2 (a, s) assumes within-subject mixture, rather than a group mixture of perfect types Random effects in models µ, µ, µ, A T K i i i A. A. A. A. A. A. A. A. A. A. A. A.

52 Estimating the hyperparameters Effectively we now want to do gradient ascent on: d d p(a ) But this contains an integral over individual parameters: p(a ) = d p(a ) p( ) So we need to: ˆ = argmax p(a ) = argmax d p(a ) p( )

53 Inference ˆ = argmax p(a ) = argmax d p(a ) p( ) analytical - rare brute force - for simple problems Expectation Maximisation - approximate, easy Variational Bayes Sampling / MCMC

54 Expectation Maximisation Z log p(a ) = log d p(a, ) Z p(a, ) = log d q( ) q( ) Z p(a, ) d q( ) log q( ) Jensen s inequality k th Estep:q (k+1) ( ) p( A, (k) ) Z k th Mstep: (k+1) argmax d q( ) log p(a, ) Iterate between Estimating MAP parameters given prior parameters Estimating prior parameters from MAP parameters

55 Bayesian Information Criterion Laplace s approximation (saddle-point method) Y log(y) X Just a Gaussian X dx f(x) f (x 0 ) 2 2

56 EM with Laplace approximation E step: only need sufficient statistics to perform M step p( A, (k) ) N (m k, S k ) Approximate and hence: q (k+1) ( ) p( A, (k) ) E step: q k ( ) = N (m k, S k ) m k argmax p(a k )p( (i) ) S 1 k 2 p(a k )p( (i) ) 2 =m k matlab: [m,l,,,s]=fminunc( ) Just what we had before: MAP inference given some prior parameters

57 EM with Laplace approximation Updates Prior mean = mean of MAP estimates M step: (i+1) µ = 1 K m k (i+1) 2 = 1 N k i (m k ) 2 + S k ( (i+1) µ ) 2 Prior variance depends on inverse Hessian S and variance of MAP estimates Take uncertainty of estimates into account And now iterate until convergence

58 Parameter recovery Simulation 1 truly uncorrelated, no GLM Simulation 4 truly correlated, no GLM A B C β 1 correlation D βʼ correlation αʼ correlation true inferred true inferred true inferred T ω correlation true inferred E T F T G T H T Rescorla-Wagner Model T T step model T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 EM-MAP MAP0 ML

59 Correlations A Correlation ML βʼ and αʼ C Correlation MAP βʼʼ and αʼʼ E Prior covariance MAP βʼʼ and αʼʼ RW B T Correlation ML β 1ʼ and ωʼ D T Correlation MAP β 1ʼ and ωʼ F T Prior covariance MAP β 1ʼ and ωʼ 2-step T T T

60 Are parameters ok for correlations? Individual subject parameter estimates NO LONGER INDEPENDENT! Change group -> change parameter estimates compare different params if different priors correlations, t-tests within same prior ok A B true RW 2-step ML 0.1 C EM-MAP p value D MAP p value p value p value Nsj = 10 Nsj = 30 Nsj = 100

61 GLM So far infer individual parameters apply standard tests Alternative View as variation across group Specific - more powerful? µ, µ i = µ Group + i i Infer A. A. A. A.

62 GLM Group-level regressor A GLM estimate GLM estimate C T B GLM estimate GLM estimate T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = T D T E F Nsj = 10 Nsj = 30 Nsj = 100

63 Fitting - how to Write your likelihood function matlab examples attached with emfit.m don t do 20 ML fits! pass it into emfit.m or julia version validate: generate data with fitted params compare, have a look, does it look right? re-fit - is it stable? model comparison now: look at parameters, do correlations etc. Future: GLM full random effects over models and parameters jointly? Daniel Schad E Count Reward sensitivity ρ Learning rate ε p value

64 Hierarchical / random effects models Advantages Accurate group-level mean and variance Outliers due to weak likelihood are regularised Strong outliers are not Useful for model selection Disadvantages Individual estimates depend on other data, i.e. on A j6=i and i therefore need to be careful in interpreting these as summary statistics More involved; less transparent Psychiatry Groups often not well defined, covariates better fmri Shrink variance of ML estimates - fixed effects better still?

65 How does it do? LL Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq Sep:βα Q Sep:βαQ 2βα2 GoOnly βα2 GoOnly Q βα GoOnly Q βα4 Q βα Q βαq

66 Overfitting Y X

67 Model comparison A fit by itself is not meaningful Generative test qualitative Comparisons vs random vs other model -> test specific hypotheses and isolate particular effects in a generative setting

68 Model comparison Averaged over its parameter settings, how well does the model fit the data? p(a M) = Model comparison: Bayes factors Z d p(a ) p( M) BF = p(a M 1) p(a M 2 ) Problem: integral rarely solvable approximation: Laplace, sampling, variational...

69 Why integrals? The God Almighty test 0.1 Powerful model 0.1 Weak model N (p(x 1)+p(X 2 )+ ) These two factors fight it out Model complexity vs model fit

70 Group-level BIC log p(a M) = d p(a ) p( M) 1 2 BIC int = log ˆp(A ˆML ) 1 M log( A ) 2 Very simple 1) EM to estimate group prior mean & variance simply done using fminunc, which provides Hessians 2) Sample from estimated priors 3) Average

71 How does it do? LL BIC int Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq βαq βα Q βα4 Q βα GoOnly Q βα2 GoOnly Q 2βα2 GoOnly Sep:βαQ Sep:βα Q Σ i BIC i LL Int Fitted by EM... too nice?

72 Group Model selection Integrate out your parameters

73 Model comparison: overfitting? G Nogo Rewarded RW RW + noise RW + noise + Q0 Avoids loss RW + noise + bias RW (rew/pun) + noise + bias RW + noise + bias + Pav BIC int Note: same number of parameters 1 Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go)

74 Behavioural data modelling Are no panacea statistics about specific aspects of decision machinery only account for part of the variance Model needs to match experiment ensure subjects actually do the task the way you wrote it in the model model comparison Model = Quantitative hypothesis strong test need to compare models, not parameters includes all consequences of a hypothesis for choice

75 Thanks Peter Dayan Daniel Schad Nathaniel Daw SNSF DFG

Modelling behavioural data

Modelling behavioural data Modelling behavioural data Quentin Huys MA PhD MBBS MBPsS Translational Neuromodeling Unit, ETH Zürich Psychiatrische Universitätsklinik Zürich Outline An example task Why build models? What is a model

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associated to the value Teaching signal : difference between

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

SRNDNA Model Fitting in RL Workshop

SRNDNA Model Fitting in RL Workshop SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of D learning with

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Animal learning theory

Animal learning theory Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning Introduction to Reinforcement Learning Part 5: emporal-difference Learning What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Olivier Sigaud. September 21, 2012

Olivier Sigaud. September 21, 2012 Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Reinforcement Learning. Value Function Updates

Reinforcement Learning. Value Function Updates Reinforcement Learning Value Function Updates Manfred Huber 2014 1 Value Function Updates Different methods for updating the value function Dynamic programming Simple Monte Carlo Temporal differencing

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information