Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

Size: px

Start display at page:

Download "Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem"

Randall Logan
5 years ago
Views:

1 Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003

2 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated $0 billion spending Important decision Who to mail catalogs to (Hayes 992, Gönül and Shi 998) Difficult problem Dynamic nature Challenges Constructing dynamic programming model from off policy sample trajectories

3 Problem and Motivation Contributions A DP approach for finding dynamic catalog mailing decisions Profit improvement Intuition on optimal policy Addressing general problems on constructing dynamic programming / reinforcement learning models from historical data 2

4 Contents The Catalog Mailing Problem Background Model Computational results Dynamic programming models from data Endogeneity Problem Attribution errors Fixed point in batch online learning Effects of model inaccuracy Field Test 3

5 The Catalog Mailing problem Background Prospective Customers versus House Customers The RFM model: Bult and Wansbeek (995) Recency, Frequency, Monetary Likelihood of response as mailing criteria Used in industry and academic research Dynamic programming based on RFM Bitran and Mondschien (996) Gönül and Shi (998) 4

6 The Catalog Mailing problem Model Overview The problem Objective Learn a near optimal mailing policy directly from data Available data Transaction history data Mailing history data 5

7 The Catalog Mailing problem Model Overview Purchases Mailings Time 6

8 The Catalog Mailing problem Model Overview The Model Discounted infinite time horizon DP Finite state space S encoding customers historical information Decision: mail or not mail A = {0, } Mailing policy π : S A. Profit-to-go (life time value) V π (s) := t=0 E [ α T tg t π s ] Objective π = arg max V π Transition probability P π and reward g π estimated from data 7

9 The Catalog Mailing problem Model Overview Data preprocess Get n variables values for each customer at each time period State space Construction Approximate Function H V ~ Binary Tree State Space Construction Solving DP Estimate P, g Policy iteration Algorithm 8

10 The Catalog Mailing problem Variables Transaction History Recency, Frequency, Monetary Purchase stocks Time since becoming a customer Mailing History Mailing stocks Seasonality Time of the year 9

11 The Catalog Mailing problem State Space Construction Linear cuts organized by a binary tree structure Criteria for segments ~ V π H Neighborhood in R n Similar profitability Approximate value function Ṽ π H R n 0

12 The Catalog Mailing problem State Space Construction Binary Tree 2 2

13 The Catalog Mailing problem Data A large catalog mailing retailer selling multiple product categories.8 million customers in the clothing section Transaction and mailing history over the past 6 years Construct our model using 00, 000 customers Over 5 million number of observations (a customer at a time period) 2

14 The Catalog Mailing problem Computational Results Different Discount Rate... Table Average Profit-to-Go Estimates and Mailing Rates by Discount Rate Monthly Interest Rate (- ) Normalized Average V Historical Policy Optimal Policy Historical Policy 5% $.64 58% 0% $ % 5% $ % 3% $ % 0.87% $ % Mailing Rate Optimal Policy 3

15 The Catalog Mailing problem Computational Results...Different Discount Rate Table Average Profit-to-Go Estimates and Mailing Rates by Discount Rate Normalized Average V Discount Rate (- ) Historical Policy Optimal Policy Historical Policy Mailing Rate Optimal Policy 5% $.64 $ % 3% 0% $8.45 $2.7 58% 43% 5% $37.39 $ % 62% 3% $59.75 $ % 7% 0.87% $59.7 $ % 78% 4

16 The Catalog Mailing problem Computational Results Long Term Profit Flow Undiscounted Profit Flow Profit ($) Time Periods historical

17 The Catalog Mailing problem Computational Results Change Num. of States Table 2 Average Profit-to-Go Estimates By Discount Rate and Number of States Discount Rate (- ) Optimal Policy: Number of States 500,000 2,000 5% $3.52 $4.02 $4.52 5% $48.23 $49.88 $5.42 3% $86.69 $90.00 $

18 The Catalog Mailing problem Computational Results Mailing Policy... Current Policy by Purchase Recency and Mailing Stock 0.8 Mailing Rate Mailing Stock Recency # of months 7

19 The Catalog Mailing problem Computational Results...Mailing Policy Optimal Policy to Purchase Recency and Mailing Stock Mailing Rate Mailing Stock Recency # of months 2 8

20 The Catalog Mailing problem Computational Results Profit-to-go... Current Profit to go to Purchase Recency and Mailling Stock Profit to go ($) Mailling Stock Recency # of months 9

21 The Catalog Mailing problem Computational Results...Profit-to-go Optimal Profit to go to Purchase Recency and Mailing Stock Profit to go ($) Mailing Stock Recency # of months 20

22 Constructing DP Model from Data Underlying model estimated directly from historical data Endogeneity Problem: Optimal policy depends on the current policy caused by hidden state information Attribution error: Historic policy π H not complete on hidden states. Batch online learning: Iteratively update policies and estimate model paramters Self-enforcing policies: Data validates policy s optimality Random noise: Randomness in the historical data may bias the DP results. 2

23 Constructing DP Model from Data Endogeneity Problem Attribution Error Hidden state information affects historical policy. Mailed and not mailed customers vary on an unobservable dimension. DP algorithm attributes the effect of the hidden information to different actions. Since mail ( not mail ) was good, you should mail (not mail) to everyone in the state! Upward biased profit-to-go estimation. Mitigate the effects in computation 22

24 Constructing DP Model from Data Endogeneity Problem Theoretical Justification... a b a b i j i k j c d c d Steady state probability p Profit-to-go V Steady state probability p Profit-to-go Ṽ 23

25 Constructing DP Model from Data Endogeneity Problem...Theoretical Justification Proposition p T V = p T Ṽ. Suppose according to the historical policy, actions taken at i and j were different. Proposition 2 Either Ṽ [i] Ṽ or Ṽ [j] Ṽ. If P [i] Ṽ P [j] Ṽ then the above inequalities hold strictly for some components. 24

26 Constructing DP Model from Data Endogeneity Problem Online Learning and Fixed Points... Aggregated state space: State space summarizes historical information Batch online learning: Iterates between estimating model parameters and updating the policy to collect data Self-enforcing policies: Does the above batch online learning procedure converge? to what? 25

27 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Self-enforcing policies Data collect according to a policy self-enforces the optimality of the policy batch online learning stops Do such self-enforcing policies exist? What are the properties of such policies if they do exist? 26

28 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: m states aggregated into one state. Hidden state space: m states and n actions Nature knows (in the hidden state space): transition probability matrix P a immediate reward vector g a (Randomized) policy: vector λ [0, ] n such that λ T e = Observe: P λ := a λ ap a ; g λ := a λ ag a Steady state probability p λ : p T λ = pt λ P λ 27

29 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: Self-enforcing policy λ : p λ g a = p λ g a, a, a I λ, p λ g a p λ g a, a I λ, a Ī λ and λ T e = or equivalently, Equilibrium-Optimality (E-O) Condition: (p λ g a g ) λ a = 0, a =,..., n p λ g a g 0, a =,..., n λ T e = 28

30 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: self-enforcing policies exist. Define functional F (λ, g) := Variational Inequality (VI) p T λ g a g λ T e F (λ, g ) ((λ, g ) (λ, g)) 0, (λ, g) [0, ] m [g, ḡ] E-O Condition is equivalent to VI Formulation 29

31 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Simple Case: self-enforcing policies exist. Proposition 3 () Any (λ, g ) satisfying the E-O Condition also satisfies the VI Formulation; (2) If there exists (λ, g ) satisfying the VI Formulation, the E-O Condition also has a solution. Theorem 4 If P is a irreducible matrix, there exists a policy λ such that the E-O Condition holds. 30

32 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... General Case: an m state space aggregated to a m state space. A and B connect the original and aggregated state space. A = B(p) = p p +p 2 p 2 p +p 2... p i i j=i p j... p i i j=i p j... p n n j=i p j p n n j=i p j (Randomized) policy: matrix λ [0, ] m n such that λe n = e m. B λ := B(p) when p is the steady state probability of policy λ. g λ = B λ g λ ; P λ = B λ P λ A 3

33 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Denote( ) Ṽ λ := I α P g = Bλ g λ + αb λ P λ AṼ λ and V λ ( s, a) := B ( s) λ g s,a + αb ( s) λ P s,aaṽ λ E-O ( Condition: ) V λ ( s, a) ˆV s λ s,a = 0 s S, a U V λ ( s, a) ˆV s 0, s S, a U λ T e n = e m 32

34 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Variational Inequality (VI) formulation: First m n components of F : F s,a (λ, ˆV ) := V λ ( s, a) ˆV ( s) s S, a U Last m components of F : F s (λ, ˆV ) := a λ s,a s S, a U 33

35 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... An example of multiple fixed points: 2 stats, 3 actions g = [ (, ); g 2 = ( 2, 2); g 3 = (0, 0) 9 ] [ ] [ ] P = ; P 2 = 2 2 ; P 3 = self-enforcing policies: 4 Policy Policy 2 Policy 3 λ ( 0, 36, ) ( , 0, ) ( (, 0, 0) p λ 7, ( 7) 6 9 0, ( 0) 0 ) ( g λ, 60 7, ) ( 60 7, 48 5, ) (, ,, ) 34

36 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points... Policies λ ( 0, 36, ) ( p λ 7, ) ( 6 7 g λ, 60 7, ) 60 7 λ ( , 0, ) ( p 9 λ 0, ) ( 0 g λ, 48 5, ) λ (, ( 0, 0) p 0 λ, ) ( g λ, 08, ) Three Fixed Points in a 2 State, 3 Actions Markov Decisoin Process p λ T g Fixed Point Fixed Point 2 g=(,) g=( 2,2) g=(0,0) Fixed Point / / Steady State Probability for State 35

37 Constructing DP Model from Data Endogeneity Problem...Online Learning and Fixed Points Further issues: Stability of fixed points Best self-enforcing policies Batch on-line learning algorithms 36

38 Constructing DP Model from Data Random Noise in Data Transition probability model and reward are estimated from data. Estimation errors lead to upward bias in the optimal profit-to-go estimation. When two actions provides similar profit-to-go, the policy chooses the one which is larger due to random noise. Theoretical justification and computational evidence in the dynamic catalog mailing problem. 37

39 Constructing DP Model from Data Random Noise in Data Theoretical justification Effect of zero mean perturbation g Proposition 5 E g [V ( g)] V Effect of zero mean perturbation P... two types of effects Cross-over between different actions upward bias. For a fixed policy, bias could be upward or downward, since (I αp ) g is a nonlinear function of P. 38

40 Constructing DP Model from Data Random Noise in Data Evidence of Bias... Table 3 Average Profit-to-Go Estimates from a Separate Validation Sample by Discount Rate and Number of States (monthly discount 3%) Number of Average Profit-to-Go Current Policy States Calibration Validation 500 $86.69 $75.8 $ $90.00 $75.38 $ $92.86 $75.5 $

41 Constructing DP Model from Data Random Noise in Data...Evidence of Bias 95 Profit to go Prediction and Validation Prediction Validation 90 Profit to go Estimation ($) Number of States 40

42 Field Test Issues cannot be resolved by observing historical data: The endogeneity problem caused by hidden information; Model parameter estimation error; Non-Stationarity. A large-scale field test of the proposed model is underway. 60, 000 customers Randomly assigned to (Treatment and Control groups) Making decisions for the Treatment group over 6 months 42

43 Field Test Decision Date Mailing Date 5-Nov Jan Nov Jan Dec Feb Dec Feb Jan Mar Jan Mar Feb Apr Feb Apr Mar May Mar May Apr Jun May Jun-2003 Comparisons between Treatment and Control groups: Profit Policy Distribution of customers Profit-to-go estimation Model parameter estimation Last-year-group 42

44 Field Test Empirical Results Immediate Reward 2 Profit Comparison between Last Year and the Field Test.8.6 Average Profit ($) Last Year Profit Field Test Profit Last Year Profit plus Mailing Cost Field Test Profit plus Mailing Cost Mailing Periods 43

45 Field Test Empirical Results Policy Mailing Rates Last Year Field Test 0.9 Mailing Rate Mailing Period 44

46 Field Test Empirical Results Customer Distribution Number of Visits to Each State 0.02 Field Test Original Model Number of Visits State Index 45

47 Field Test Empirical Results Fitting Bellman Equation Fitting Bellman Equation Profit to go and Immediate Reward Average Profit to go Estimation One Step Discounted P t g Estimation One Step Discounted P t g plus Reward Profit to go Estimation ($) Mailing Periods 46

48 Future Research Opportunities Further investigation in batch on-line learning Error structure for the profit-to-go estimation using off policy sample trajectories Robust policies with model inaccuracy 47

49 Conclusions General solution framework for constructing dynamic decision making models from data Dynamic direct mailing problems Large potential profit improvements Calibrate model for field test Endogeneity problems Attribution error Batch on-line learning and Fixed point properties Effects of model estimation errors Field test 48

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)