Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

Size: px
Start display at page:

Download "Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem"

Transcription

1 Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003

2 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated $0 billion spending Important decision Who to mail catalogs to (Hayes 992, Gönül and Shi 998) Difficult problem Dynamic nature Challenges Constructing dynamic programming model from off policy sample trajectories

3 Problem and Motivation Contributions A DP approach for finding dynamic catalog mailing decisions Profit improvement Intuition on optimal policy Addressing general problems on constructing dynamic programming / reinforcement learning models from historical data 2

4 Contents The Catalog Mailing Problem Background Model Computational results Dynamic programming models from data Endogeneity Problem Attribution errors Fixed point in batch online learning Effects of model inaccuracy Field Test 3

5 The Catalog Mailing problem Background Prospective Customers versus House Customers The RFM model: Bult and Wansbeek (995) Recency, Frequency, Monetary Likelihood of response as mailing criteria Used in industry and academic research Dynamic programming based on RFM Bitran and Mondschien (996) Gönül and Shi (998) 4

6 The Catalog Mailing problem Model Overview The problem Objective Learn a near optimal mailing policy directly from data Available data Transaction history data Mailing history data 5

7 The Catalog Mailing problem Model Overview Purchases Mailings Time 6

8 The Catalog Mailing problem Model Overview The Model Discounted infinite time horizon DP Finite state space S encoding customers historical information Decision: mail or not mail A = {0, } Mailing policy π : S A. Profit-to-go (life time value) V π (s) := t=0 E [ α T tg t π s ] Objective π = arg max V π Transition probability P π and reward g π estimated from data 7

9 The Catalog Mailing problem Model Overview Data preprocess Get n variables values for each customer at each time period State space Construction Approximate Function H V ~ Binary Tree State Space Construction Solving DP Estimate P, g Policy iteration Algorithm 8

10 The Catalog Mailing problem Variables Transaction History Recency, Frequency, Monetary Purchase stocks Time since becoming a customer Mailing History Mailing stocks Seasonality Time of the year 9

11 The Catalog Mailing problem State Space Construction Linear cuts organized by a binary tree structure Criteria for segments ~ V π H Neighborhood in R n Similar profitability Approximate value function Ṽ π H R n 0

12 The Catalog Mailing problem State Space Construction Binary Tree 2 2

13 The Catalog Mailing problem Data A large catalog mailing retailer selling multiple product categories.8 million customers in the clothing section Transaction and mailing history over the past 6 years Construct our model using 00, 000 customers Over 5 million number of observations (a customer at a time period) 2

14 The Catalog Mailing problem Computational Results Different Discount Rate... Table Average Profit-to-Go Estimates and Mailing Rates by Discount Rate Monthly Interest Rate (- ) Normalized Average V Historical Policy Optimal Policy Historical Policy 5% $.64 58% 0% $ % 5% $ % 3% $ % 0.87% $ % Mailing Rate Optimal Policy 3

15 The Catalog Mailing problem Computational Results...Different Discount Rate Table Average Profit-to-Go Estimates and Mailing Rates by Discount Rate Normalized Average V Discount Rate (- ) Historical Policy Optimal Policy Historical Policy Mailing Rate Optimal Policy 5% $.64 $ % 3% 0% $8.45 $2.7 58% 43% 5% $37.39 $ % 62% 3% $59.75 $ % 7% 0.87% $59.7 $ % 78% 4

16 The Catalog Mailing problem Computational Results Long Term Profit Flow Undiscounted Profit Flow Profit ($) Time Periods historical

17 The Catalog Mailing problem Computational Results Change Num. of States Table 2 Average Profit-to-Go Estimates By Discount Rate and Number of States Discount Rate (- ) Optimal Policy: Number of States 500,000 2,000 5% $3.52 $4.02 $4.52 5% $48.23 $49.88 $5.42 3% $86.69 $90.00 $

18 The Catalog Mailing problem Computational Results Mailing Policy... Current Policy by Purchase Recency and Mailing Stock 0.8 Mailing Rate Mailing Stock Recency # of months 7

19 The Catalog Mailing problem Computational Results...Mailing Policy Optimal Policy to Purchase Recency and Mailing Stock Mailing Rate Mailing Stock Recency # of months 2 8

20 The Catalog Mailing problem Computational Results Profit-to-go... Current Profit to go to Purchase Recency and Mailling Stock Profit to go ($) Mailling Stock Recency # of months 9

21 The Catalog Mailing problem Computational Results...Profit-to-go Optimal Profit to go to Purchase Recency and Mailing Stock Profit to go ($) Mailing Stock Recency # of months 20

22 Constructing DP Model from Data Underlying model estimated directly from historical data Endogeneity Problem: Optimal policy depends on the current policy caused by hidden state information Attribution error: Historic policy π H not complete on hidden states. Batch online learning: Iteratively update policies and estimate model paramters Self-enforcing policies: Data validates policy s optimality Random noise: Randomness in the historical data may bias the DP results. 2

23 Constructing DP Model from Data Endogeneity Problem Attribution Error Hidden state information affects historical policy. Mailed and not mailed customers vary on an unobservable dimension. DP algorithm attributes the effect of the hidden information to different actions. Since mail ( not mail ) was good, you should mail (not mail) to everyone in the state! Upward biased profit-to-go estimation. Mitigate the effects in computation 22

24 Constructing DP Model from Data Endogeneity Problem Theoretical Justification... a b a b i j i k j c d c d Steady state probability p Profit-to-go V Steady state probability p Profit-to-go Ṽ 23

25 Constructing DP Model from Data Endogeneity Problem...Theoretical Justification Proposition p T V = p T Ṽ. Suppose according to the historical policy, actions taken at i and j were different. Proposition 2 Either Ṽ [i] Ṽ or Ṽ [j] Ṽ. If P [i] Ṽ P [j] Ṽ then the above inequalities hold strictly for some components. 24

26 Constructing DP Model from Data Endogeneity Problem Online Learning and Fixed Points... Aggregated state space: State space summarizes historical information Batch online learning: Iterates between estimating model parameters and updating the policy to collect data Self-enforcing policies: Does the above batch online learning procedure converge? to what? 25

27 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Self-enforcing policies Data collect according to a policy self-enforces the optimality of the policy batch online learning stops Do such self-enforcing policies exist? What are the properties of such policies if they do exist? 26

28 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: m states aggregated into one state. Hidden state space: m states and n actions Nature knows (in the hidden state space): transition probability matrix P a immediate reward vector g a (Randomized) policy: vector λ [0, ] n such that λ T e = Observe: P λ := a λ ap a ; g λ := a λ ag a Steady state probability p λ : p T λ = pt λ P λ 27

29 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: Self-enforcing policy λ : p λ g a = p λ g a, a, a I λ, p λ g a p λ g a, a I λ, a Ī λ and λ T e = or equivalently, Equilibrium-Optimality (E-O) Condition: (p λ g a g ) λ a = 0, a =,..., n p λ g a g 0, a =,..., n λ T e = 28

30 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: self-enforcing policies exist. Define functional F (λ, g) := Variational Inequality (VI) p T λ g a g λ T e F (λ, g ) ((λ, g ) (λ, g)) 0, (λ, g) [0, ] m [g, ḡ] E-O Condition is equivalent to VI Formulation 29

31 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: self-enforcing policies exist. Proposition 3 () Any (λ, g ) satisfying the E-O Condition also satisfies the VI Formulation; (2) If there exists (λ, g ) satisfying the VI Formulation, the E-O Condition also has a solution. Theorem 4 If P is a irreducible matrix, there exists a policy λ such that the E-O Condition holds. 30

32 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... General Case: an m state space aggregated to a m state space. A and B connect the original and aggregated state space. A = B(p) = p p +p 2 p 2 p +p 2... p i i j=i p j... p i i j=i p j... p n n j=i p j p n n j=i p j (Randomized) policy: matrix λ [0, ] m n such that λe n = e m. B λ := B(p) when p is the steady state probability of policy λ. g λ = B λ g λ ; P λ = B λ P λ A 3

33 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Denote( ) Ṽ λ := I α P g = Bλ g λ + αb λ P λ AṼ λ and V λ ( s, a) := B ( s) λ g s,a + αb ( s) λ P s,aaṽ λ E-O ( Condition: ) V λ ( s, a) ˆV s λ s,a = 0 s S, a U V λ ( s, a) ˆV s 0, s S, a U λ T e n = e m 32

34 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Variational Inequality (VI) formulation: First m n components of F : F s,a (λ, ˆV ) := V λ ( s, a) ˆV ( s) s S, a U Last m components of F : F s (λ, ˆV ) := a λ s,a s S, a U 33

35 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... An example of multiple fixed points: 2 stats, 3 actions g = [ (, ); g 2 = ( 2, 2); g 3 = (0, 0) 9 ] [ ] [ ] P = ; P 2 = 2 2 ; P 3 = self-enforcing policies: 4 Policy Policy 2 Policy 3 λ ( 0, 36, ) ( , 0, ) ( (, 0, 0) p λ 7, ( 7) 6 9 0, ( 0) 0 ) ( g λ, 60 7, ) ( 60 7, 48 5, ) (, ,, ) 34

36 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Policies λ ( 0, 36, ) ( p λ 7, ) ( 6 7 g λ, 60 7, ) 60 7 λ ( , 0, ) ( p 9 λ 0, ) ( 0 g λ, 48 5, ) λ (, ( 0, 0) p 0 λ, ) ( g λ, 08, ) Three Fixed Points in a 2 State, 3 Actions Markov Decisoin Process p λ T g Fixed Point Fixed Point 2 g=(,) g=( 2,2) g=(0,0) Fixed Point / / Steady State Probability for State 35

37 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points Further issues: Stability of fixed points Best self-enforcing policies Batch on-line learning algorithms 36

38 Constructing DP Model from Data Random Noise in Data Transition probability model and reward are estimated from data. Estimation errors lead to upward bias in the optimal profit-to-go estimation. When two actions provides similar profit-to-go, the policy chooses the one which is larger due to random noise. Theoretical justification and computational evidence in the dynamic catalog mailing problem. 37

39 Constructing DP Model from Data Random Noise in Data Theoretical justification Effect of zero mean perturbation g Proposition 5 E g [V ( g)] V Effect of zero mean perturbation P... two types of effects Cross-over between different actions upward bias. For a fixed policy, bias could be upward or downward, since (I αp ) g is a nonlinear function of P. 38

40 Constructing DP Model from Data Random Noise in Data Evidence of Bias... Table 3 Average Profit-to-Go Estimates from a Separate Validation Sample by Discount Rate and Number of States (monthly discount 3%) Number of Average Profit-to-Go Current Policy States Calibration Validation 500 $86.69 $75.8 $ $90.00 $75.38 $ $92.86 $75.5 $

41 Constructing DP Model from Data Random Noise in Data...Evidence of Bias 95 Profit to go Prediction and Validation Prediction Validation 90 Profit to go Estimation ($) Number of States 40

42 Field Test Issues cannot be resolved by observing historical data: The endogeneity problem caused by hidden information; Model parameter estimation error; Non-Stationarity. A large-scale field test of the proposed model is underway. 60, 000 customers Randomly assigned to (Treatment and Control groups) Making decisions for the Treatment group over 6 months 42

43 Field Test Decision Date Mailing Date 5-Nov Jan Nov Jan Dec Feb Dec Feb Jan Mar Jan Mar Feb Apr Feb Apr Mar May Mar May Apr Jun May Jun-2003 Comparisons between Treatment and Control groups: Profit Policy Distribution of customers Profit-to-go estimation Model parameter estimation Last-year-group 42

44 Field Test Empirical Results Immediate Reward 2 Profit Comparison between Last Year and the Field Test.8.6 Average Profit ($) Last Year Profit Field Test Profit Last Year Profit plus Mailing Cost Field Test Profit plus Mailing Cost Mailing Periods 43

45 Field Test Empirical Results Policy Mailing Rates Last Year Field Test 0.9 Mailing Rate Mailing Period 44

46 Field Test Empirical Results Customer Distribution Number of Visits to Each State 0.02 Field Test Original Model Number of Visits State Index 45

47 Field Test Empirical Results Fitting Bellman Equation Fitting Bellman Equation Profit to go and Immediate Reward Average Profit to go Estimation One Step Discounted P t g Estimation One Step Discounted P t g plus Reward Profit to go Estimation ($) Mailing Periods 46

48 Future Research Opportunities Further investigation in batch on-line learning Error structure for the profit-to-go estimation using off policy sample trajectories Robust policies with model inaccuracy 47

49 Conclusions General solution framework for constructing dynamic decision making models from data Dynamic direct mailing problems Large potential profit improvements Calibrate model for field test Endogeneity problems Attribution error Batch on-line learning and Fixed point properties Effects of model estimation errors Field test 48

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Introduction to Forecasting

Introduction to Forecasting Introduction to Forecasting Introduction to Forecasting Predicting the future Not an exact science but instead consists of a set of statistical tools and techniques that are supported by human judgment

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Time Series and Forecasting

Time Series and Forecasting Time Series and Forecasting Introduction to Forecasting n What is forecasting? n Primary Function is to Predict the Future using (time series related or other) data we have in hand n Why are we interested?

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Time Series and Forecasting

Time Series and Forecasting Time Series and Forecasting Introduction to Forecasting n What is forecasting? n Primary Function is to Predict the Future using (time series related or other) data we have in hand n Why are we interested?

More information

Algorithms for MDPs and Their Convergence

Algorithms for MDPs and Their Convergence MS&E338 Reinforcement Learning Lecture 2 - April 4 208 Algorithms for MDPs and Their Convergence Lecturer: Ben Van Roy Scribe: Matthew Creme and Kristen Kessel Bellman operators Recall from last lecture

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

GAMINGRE 8/1/ of 7

GAMINGRE 8/1/ of 7 FYE 09/30/92 JULY 92 0.00 254,550.00 0.00 0 0 0 0 0 0 0 0 0 254,550.00 0.00 0.00 0.00 0.00 254,550.00 AUG 10,616,710.31 5,299.95 845,656.83 84,565.68 61,084.86 23,480.82 339,734.73 135,893.89 67,946.95

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Approximating fixed-horizon forecasts using fixed-event forecasts

Approximating fixed-horizon forecasts using fixed-event forecasts Approximating fixed-horizon forecasts using fixed-event forecasts Comments by Simon Price Essex Business School June 2016 Redundant disclaimer The views in this presentation are solely those of the presenter

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

BUSI 460 Suggested Answers to Selected Review and Discussion Questions Lesson 7

BUSI 460 Suggested Answers to Selected Review and Discussion Questions Lesson 7 BUSI 460 Suggested Answers to Selected Review and Discussion Questions Lesson 7 1. The definitions follow: (a) Time series: Time series data, also known as a data series, consists of observations on a

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 011 MODULE 3 : Stochastic processes and time series Time allowed: Three Hours Candidates should answer FIVE questions. All questions carry

More information

Salem Economic Outlook

Salem Economic Outlook Salem Economic Outlook November 2012 Tim Duy, PHD Prepared for the Salem City Council November 7, 2012 Roadmap US Economic Update Slow and steady Positives: Housing/monetary policy Negatives: Rest of world/fiscal

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine

Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine Song Li 1, Peng Wang 1 and Lalit Goel 1 1 School of Electrical and Electronic Engineering Nanyang Technological University

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Lecture Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University

Lecture Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University Lecture 15 20 Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University Modeling for Time Series Forecasting Forecasting is a necessary input to planning, whether in business,

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Influencing Social Evolutionary Dynamics

Influencing Social Evolutionary Dynamics Influencing Social Evolutionary Dynamics Jeff S Shamma Georgia Institute of Technology MURI Kickoff February 13, 2013 Influence in social networks Motivating scenarios: Competing for customers Influencing

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Behavior Policy Gradient Supplemental Material

Behavior Policy Gradient Supplemental Material Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Technical note on seasonal adjustment for M0

Technical note on seasonal adjustment for M0 Technical note on seasonal adjustment for M0 July 1, 2013 Contents 1 M0 2 2 Steps in the seasonal adjustment procedure 3 2.1 Pre-adjustment analysis............................... 3 2.2 Seasonal adjustment.................................

More information

Planning and Model Selection in Data Driven Markov models

Planning and Model Selection in Data Driven Markov models Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),

More information

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning. Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

Dynamic Discrete Choice Structural Models in Empirical IO

Dynamic Discrete Choice Structural Models in Empirical IO Dynamic Discrete Choice Structural Models in Empirical IO Lecture 4: Euler Equations and Finite Dependence in Dynamic Discrete Choice Models Victor Aguirregabiria (University of Toronto) Carlos III, Madrid

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

1 Hotz-Miller approach: avoid numeric dynamic programming

1 Hotz-Miller approach: avoid numeric dynamic programming 1 Hotz-Miller approach: avoid numeric dynamic programming Rust, Pakes approach to estimating dynamic discrete-choice model very computer intensive. Requires using numeric dynamic programming to compute

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

RECURSION EQUATION FOR

RECURSION EQUATION FOR Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Abram Gross Yafeng Peng Jedidiah Shirey

Abram Gross Yafeng Peng Jedidiah Shirey Abram Gross Yafeng Peng Jedidiah Shirey Contents Context Problem Statement Method of Analysis Forecasting Model Way Forward Earned Value NOVEC Background (1 of 2) Northern Virginia Electric Cooperative

More information

Time Series Analysis

Time Series Analysis Time Series Analysis A time series is a sequence of observations made: 1) over a continuous time interval, 2) of successive measurements across that interval, 3) using equal spacing between consecutive

More information

FEB DASHBOARD FEB JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

FEB DASHBOARD FEB JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Positive Response Compliance 215 Compliant 215 Non-Compliant 216 Compliant 216 Non-Compliant 1% 87% 96% 86% 96% 88% 89% 89% 88% 86% 92% 93% 94% 96% 94% 8% 6% 4% 2% 13% 4% 14% 4% 12% 11% 11% 12% JAN MAR

More information

Factored State Spaces 3/2/178

Factored State Spaces 3/2/178 Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Lecture Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University

Lecture Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University Lecture 15 20 Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University Modeling for Time Series Forecasting Forecasting is a necessary input to planning, whether in business,

More information

Basis Construction from Power Series Expansions of Value Functions

Basis Construction from Power Series Expansions of Value Functions Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Peter Fader Professor of Marketing, The Wharton School Co-Director, Wharton Customer Analytics Initiative

Peter Fader Professor of Marketing, The Wharton School Co-Director, Wharton Customer Analytics Initiative DATA-DRIVEN DONOR MANAGEMENT Peter Fader Professor of Marketing, The Wharton School Co-Director, Wharton Customer Analytics Initiative David Schweidel Assistant Professor of Marketing, University of Wisconsin-

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Introduction to Course

Introduction to Course .. Introduction to Course Oran Kittithreerapronchai 1 1 Department of Industrial Engineering, Chulalongkorn University Bangkok 10330 THAILAND last updated: September 17, 2016 COMP METH v2.00: intro 1/

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

arxiv: v1 [math.oc] 23 Oct 2017

arxiv: v1 [math.oc] 23 Oct 2017 Stability Analysis of Optimal Adaptive Control using Value Iteration Approximation Errors Ali Heydari arxiv:1710.08530v1 [math.oc] 23 Oct 2017 Abstract Adaptive optimal control using value iteration initiated

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann ECLT 5810 Classification Neural Networks Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann Neural Networks A neural network is a set of connected input/output

More information

Forecasting. Copyright 2015 Pearson Education, Inc.

Forecasting. Copyright 2015 Pearson Education, Inc. 5 Forecasting To accompany Quantitative Analysis for Management, Twelfth Edition, by Render, Stair, Hanna and Hale Power Point slides created by Jeff Heyl Copyright 2015 Pearson Education, Inc. LEARNING

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION 4.1 Overview This chapter contains the description about the data that is used in this research. In this research time series data is used. A time

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

problem. max Both k (0) and h (0) are given at time 0. (a) Write down the Hamilton-Jacobi-Bellman (HJB) Equation in the dynamic programming

problem. max Both k (0) and h (0) are given at time 0. (a) Write down the Hamilton-Jacobi-Bellman (HJB) Equation in the dynamic programming 1. Endogenous Growth with Human Capital Consider the following endogenous growth model with both physical capital (k (t)) and human capital (h (t)) in continuous time. The representative household solves

More information

YACT (Yet Another Climate Tool)? The SPI Explorer

YACT (Yet Another Climate Tool)? The SPI Explorer YACT (Yet Another Climate Tool)? The SPI Explorer Mike Crimmins Assoc. Professor/Extension Specialist Dept. of Soil, Water, & Environmental Science The University of Arizona Yes, another climate tool for

More information

Technical note on seasonal adjustment for Capital goods imports

Technical note on seasonal adjustment for Capital goods imports Technical note on seasonal adjustment for Capital goods imports July 1, 2013 Contents 1 Capital goods imports 2 1.1 Additive versus multiplicative seasonality..................... 2 2 Steps in the seasonal

More information

3. Problem definition

3. Problem definition 3. Problem definition In this section, we first define a multi-dimension transaction database MD, to differentiate it between the transaction databases D in traditional association rules. Then, we discuss

More information