Hazards, Densities, Repeated Events for Predictive Marketing. Bruce Lund

Size: px

Start display at page:

Download "Hazards, Densities, Repeated Events for Predictive Marketing. Bruce Lund"

Clifford Owen
6 years ago
Views:

1 Hazards, Densities, Repeated Events for Predictive Marketing Bruce Lund 1

2 A Proposal for Predicting Customer Behavior A Company wants to predict whether its customers will buy a product or obtain service during specific future time periods. One or more purchases / services by the customer in the same time period will constitute an event. The goal is to predict repeating (recurring) events same event at different time periods. Specifically, Create models that give the probability density across time that the customer s j th event occurs at future time period t. Two methodologies will be presented. These methods have performed well on simulated data and test data. I think there is potential value for CRM (customer relationship management). 2

3 Survival Data Mining Survival Data Mining is a name that SAS has used for applications in time-to-event models where: The underlying database is large and consists of transactional data from consumer marketing or credit risk The emphasis is on prediction rather than an explanatory study This talk is included within the general framework of Survival Data Mining SAS has an Enterprise Miner Application (see the Survival Node) SAS Training Courses: - Survival Data Mining Using SAS Enterprise Miner Software - Survival Data Mining: A Programming Approach - But not Survival Analysis Using the Proportional Hazards Model In this talk PROC LOGISTIC is used in fitting the models Not mentioned today: PROC PHREG, PROC LIFEREG 3

4 We Need Notation E j,t indicates that the j th event occurred at time t Let s explain using E 2,3 and the table below. j t Time period E 2,3 Cust 001 event no event event YES Cust 002 event event event NO Cust 003 no event event event YES Cust 004 event event no event NO 4

5 Graphic GOAL: To present 2 different modeling methods for finding the probability densities shown in the graphic. f j,t is probability density for E j,t for t = 1 to T Prob f j,1 j is fixed f j,2 f j,t In words: f j,t is probability that customer has j th event at time = t σ T t=1 f j,t + f j, = 1 where f j, is probability event j occurs later than T or never occurs. 1 2 T time 5

6 Customer Data in the Sample Looks Like This Time period 1 2 T Cust 001 (X1 Xk ) Event: Yes or No Event: Yes or No Tracking of customers starts at time 1 and ends at time T In each period it is observed whether customer has an event Customer events are predicted by the use of by covariates X1 Xk and a parametrization or transformation of t (time) Some customers may be censored (drop out) After censoring, a customer is no longer observed (no recording of events) But in CRM, often it is unknown who / when to censor when developing a sample for modeling So, censoring may not be considered when developing the sample Event: Yes or No 6

7 Hazards, Probabilities Densities, Survival Function Time is measured in discrete units such as day, month, year. Let h j,t be the hazard for time of the j th event This is mathematically: h j,t = P(E j,t not E j,t where t < t) The probability density of the j th event f j,t = P(E j,t ) Hazards and probabilities are connected by this formula f j,t = h j,t ς t 1 t =0 (1 h j,t ) with h j,0 = 0 The survival function S j,t is the probability that the j th event did not occur by time t S j,t = 1 - σt j=1 f j,t = ςt t =0 (1 h j,t ) 7

8 Bring h j,t f j,t and S j,t together through an example E.g. Tossing a coin with probability of heads = ¼ and successive tosses are independent. The event of interest is the second heads, j = 2 f 2,t = Probability for 2 nd heads at time t = 1, 2, 3 h 2,t = Hazard for 2 nd heads at time t = 1, 2, 3 These are easy: f 2,1 = 0 f 2,2 = ¼ * ¼ = 1/16 f 2,3 = ¼ * ¾ * ¼ + ¾ * ¼ * ¼ = 3/32 8

9 Bring h j,t f j,t and S j,t together through an example f 2,1 = 0 f 2,2 = ¼ * ¼ = 1/16 f 2,3 = ¼ * ¾ * ¼ + ¾ * ¼ * ¼ = 3/32 Now the Hazards: h 2,1 = 0 h 2,2 = 1/16 It is a little harder to see h 2,3 = 1/10 We can check that the Survival Function for j=2, t=3 satisfies the 2 formulas: S 2,3 = 27/32 = 1 - (0 + 1/16 + 3/32) = (1-0) * (1-1/16) * (1-1/10) where S j,t = 1 - σt j=1 f j,t = ςt t =0 (1 h j,t ) 9

10 Two Methods for Studying the j th Event will be Presented Discrete Time Hazard Model (DTHM) Discrete time hazard model methodology is well-known technique for modeling the hazard for the 1 st event h 1,t as a function of covariates X and time t In this talk the DTHM is simply applied to modeling the hazard for the j th event - No real change Multinomial Logistic Model (MLM) Unconventional, but it has worked in simulations and on test data Several theoretical issues but not a problem in practice 10

11 SAS EM: Survival Node and DTHM SAS EM includes a SURVIVAL Node which performs DTHM. Click on Applications in the TOOLBAR and bring in SURVIVAL Node.» The Survival Node in SAS EM performs DTHM.» This node has many powerful features. 11

12 Maximum Number of Events Logically, total number of events for a customer must be < T. Maximum to be allowed by Modeler will be called J. If events are rare, then J could be pre-set < T. If there are a few customers with > J events, then either: Remove them or Truncate to J and interpret as J or more 12

13 DTHM for Modeling the Hazard of Time of j th Event The DTHM is implemented using binary logistic regression where the input data set to PROC LOGISTIC has a specialized structure Customer-Period structure» What is the Customer-Period structure of the data? and» Why does binary logistic regression and Customer-Period structure give the hazard model? We will see (along with some hand waving) that this is a consequence of formulation of the likelihood function for a time to event model please indulge me 2 slides to follow 13

14 Let L be Likelihood Function for Time to Event Model Consider a customer. Suppose nothing happens (regarding the j th event) until time t. But then at t: either the j th event occurs OR the customer is censored If customer has j th event at t, then L = f j,t (The density is a factor in L). If customer is censored, then L = S j,t (Survival function is a factor in L). Now the contribution to L for this customer is: L = f δt j,t S 1 - δt j,t Where δ t = 1 if j th event occurred Else δ t = 0 if censored. Using formulas for f and S this expression for L is re-written in terms of hazards: L = {h j,t ς t 1 t =0 (1 h j,t )} δt t {ς t =0 (1 h j,t )} 1 - δt 14

15 Likelihood Function L is Likelihood for Binary Logistic Regression (involves hand waving ) We can transform L (via some algebra) to: ςt t =0 h Yt j,t (1 h j,t ) 1-Yt where Y t = 0. UNLESS t = t AND the customer has jth event at t then Y t = 1 Notice that exponent Y t has been redefined from δ t. This is proof by notation! This is Likelihood Function for a binary logistic regression with probability h j,t These 2 slides follow P Allison in (1982) Discrete-Time Methods for the Analysis of Event Histories 15

16 Now We See How to Structure the Data for Logistic Regression 1. If j th event at t=3, then these factors are in L (1 h j,1 ) * (1 h j,2 ) * (h j,3 ) Each factor corresponds to an observation in the data set. 2. If censored at t=2, then these factors are in L (1 h j,1 ) * (1 h j,2 ) 3. If no event through t=3, then these factors are in L (1 h j,1 ) * (1 h j,2 ) * (1 h j,3 ) How does h j,t depend on time t and X s? Next Slide ID time t covariates target 1 1 a a a b b cc cc cc 0 16

17 How does h j,t depend on time t and X s? We will select the logistic function (among several choices) h j,t = exp(xbeta) / (1 + exp(xbeta)) where xbeta = α(t) + β * X and X are covariates, β are coefficients fit by model Need to specify how α(t) depends on t Choices for α(t) include: α(t) = α 0 + α 1 * t simple linear α(t) = α 0 + α j * (t = j) + + α T-1 * (t = T-1) dummies for time t j others (cubic splines) 17

18 Compute Hazards for 2 nd event at time t PROC LOGISTIC, when applied to the Customer-Period data set, computes h 2,t Proc Logistic Data=example desc; Class t; /* = a choice of α(t) */ Model Y2= X t; Output out= predict pred= hazard; Where t > 1; Risk set starts at t=2 for second event ID t X Event Y More ID s 18

19 Densities p 2,t for 2 nd event at time t After Proc Logistic (and DATA Step) ID=2 has hazards: ID h21 h22 h23 h p21 p22 p23 p If t = 1, then p j,t = h j,t If t > 1, then p j,t = h j,t ς t 1 t =1(1 h j,t ) Why use f j,t instead of p j,t? -- Reserve f j,t for theoretical ideal. ID t X Y No observation 19

20 Model scores for p 2,4 ID h21 h22 h23 h24 p21 p22 p23 p Fit the model Next, use formulas to score p 2,4 Now all customers have hazards and probabilities for all t 20

21 Types of Covariates for h j,t Time: T - 1 dummies Time period 1 2 T OR tdum1 tdum2 tdumt-1 tdum1 tdum2 tdumt-1 tdum1 tdum2 tdumt-1 t (or a transform of t) 1 2 T X fixed at t= 0 x x x X * time interaction e.g. x*1 x*2 x*t Z time varying z1 z2 zt For prediction, time varying Z either must be lagged by T or separately forecast. 21

22 If CLASS t, then p j,t ~ correct on average If CLASS t and no censoring, then: Essentially, an equality I do not have a mathematical proof. N = customers in sample N j,t number with E j,t N j,t / N = fraction of sample with j th event at t p j,t / N ~ N j,t / N Correct on Average is less true if t or some transform of t replaces CLASS t 22

23 If CLASS t, then p j,t ~ correct on average N t = customers in sample at time t N j,t number with E j,t N j,t / N t = fraction with j th event at t Also holds for censoring: p j,t / N t ~ N j,t / N t N is replaced by N t (=number surviving to time t) For p j,t to give good estimates for individual customers, need good X k s and good modeling techniques How to fit a model is subject for another talk. Assuming a good model, then p j,t is a candidate to take the role of f j,t 23

24 Baseline Hazard Model for 1 st Event T = 6 using DTHM Maximum Likelihood Estimates Pr > DF Est Wald Wald Intercept <.0001 tdum <.0001 tdum <.0001 tdum <.0001 tdum <.0001 tdum (tdum6) 0 30% 20% 10% 0% Hazards without covariates Baseline Hazard for 1st Event Periods h 1,1 = exp( ) 1+ exp( ) = 24.6% 24

25 Baseline Probability Model for 1 st Event T = 6 The baseline probability density function, p 1,t for t = 1 to 6, is computed from the baseline hazards: 30% 20% 10% 0% Baseline Prob Density for First Event 24.6% 19.7% Periods Equals the fraction of customers with 1 st event at t in the sample If t = 1, then p j,t = h j,t If t > 1, then p j,t = h j,t ς t 1 t =1(1 h j,t ) 25

26 Probabilities for a Customer for Events j = 1 to 6 12% 10% 8% 6% 4% Prob for j th Event 1st Event 2nd Event 3rd Event 4th Event 5th Event 6th Event 2% 0% Periods 26

27 Profiles for levels of X3 among Scored Customers Compute Probabilities for each t for customers with X3=1 (and X3=2, X3=3) Average these Probabilities Take the Cum of this average 80% 70% 60% 50% 40% 30% Cum Prob of 1st Event - Profile for X3 X3=1 X3=2 X3=3 50% 40% 30% 20% 10% Cum Prob of 3rd Event - Profiles X3 Plot Cum vs. Time X3=3 more likely to have events 20% Periods 0% Periods 27

28 We can compute: Expected Number of Events at Time t Cum_p j,t = σt t =1 p j,t = cum prob. of j th event (not j events) by time t J Expected(t) = σ j=1 Cum_p j,t where J = min(j, t) j = 1 j = 2 t p 1,t Cum_p 1,t p 2,t Cum_p 2,t Expected(t) Expected number of events by time t 28

29 Expected Number of Events at Time t Cum_p j,t = σt t =1 p j,t = prob. of j th event (not j events) by t J Expected(t) = σ j=1 Cum_p j,t where J = min(j, t) Customers ranked by Expected into 5 ranks (separately for t= 3 and t= 6) BY RANK: Compare: Average Expected with Average Actual Simulation ranks Expected(3) Avg Events(3) Expected(6) Avg Events(6) all

30 A Second Approach to Estimate f j,t It is based on a Multinomial Logistic Model (MLM) with unordered target. MLM can be fit with PROC LOGISTIC For MLM there is a separate model for each t = 1 to T» For hazard model there was a model for each j = 1 to J Defining the target for MLM is somewhat confusing. Next slide 30

31 Targets and MLM For t = 3: Target is called ML_3 To define ML_3 see below:» Let E(t)=10 if an event occurred at t, Else E(t) = 0 ID t=1 t=2 t=3 ML_3 FORMULA E(t) + 5 * (E(3) = 0) = = E(t) + 5 * (E(3) = 0) = =

32 Targets and MLM ID t=1 t=2 t=3 ML_3 FORMULA E(t) + 5 * (E(3) = 0) = = E(t) + 5 * (E(3) = 0) = = ML_3 = 10 occurs when E 1,3 occurs no event, no event, event ML_3 = 20 occurs when E 2,3 occurs ML_3 = 30 occurs when E 3,3 occurs Other values of ML_3 (5, 15, 25 are not directly used) 32

33 Targets and MLM ID X ML_3 1 x x x x x x x 7 20 PROC LOGISTIC DATA = MLM; MODEL ML_3 (ref= 5") = X / LINK = GLOGIT; SCORE DATA= MLM OUT= SCORED; SCORED has probabilities for each level (6 in all) of ML_3 including: P(10) = q13 P(20) = q23 P(30) = q33 10 E 1,3 occurs 20 E 2,3 occurs 30 E 3,3 occurs 33

34 But we want q j,t as t varies for fixed j E.g. All T MLM models are needed for j=2 and q 2,t as t=1 to T ML_1 gives: (*) q 1,1 = Prob (E 1,1 ) q 2,1 = Prob (E 2,1 ) = 0 ML_2 gives: q 1,2 = Prob (E 1,2 ) q 2,2 = Prob (E 2,2 ) (*) By definition: q j,1 = 0 for j > 1 ML_3 gives: q 1,3 = Prob (E 1,3 ) q 2,3 = Prob (E 2,3 ) q 3,3 = Prob (E 3,3 ) ML_4 gives: q 1,4 = Prob (E 1,4 ) q 2,4 = Prob (E 2,4 ) q 3,4 = Prob (E 3,4 ) q 4,4 = Prob (E 4,4 ) ML_t gives: q 1,t = Prob (E 1,t ) q 2,t = Prob (E 2,t ) q 3,t = Prob (E 3,t ) Etc. 34

35 q j,t is correct, on average N is # customers in sample N j,t is # customers experiencing E j,t q j,t is correct on-average: q j,t / N = N j,t / N This is a standard property of multinomial logistic regression It is true for all j and t. For q j,t to give good estimates for individual customers, find good X k s. q j,t can be a good candidate to take the role of f j,t 35

36 p j,t vs. q j,t as estimators of f j,t Unsettling feature of MLM: It can happen that: σ T t=1 q j,t > 1 In practice, not a serious problem (I think). But must check. (*) But always σ T t=1 p j,t 1 there is a proof (**) Hazard Model used J models while MLM requires T models CLASS t is needed in Hazard Model for ~ correct on average. MLM has model for each t and is always correct on average For MLM all predictors interact with time (via separate models) Both can predict for T future periods» Time-varying covariates are lagged or separately forecast (*) The data structure and strong covariates for each t will restrain (not mathematically prevent) this occurrence (**) Thomas, G (1957) Probability of Sums of Series, American Mathematical Monthly, 64,

37 Comments on MLM If needed, compute Hazards from q j,t h j,t = q j,t / S j,(t -1) = q j,t / (1 - σ t 1 i=1 q j,i ) If J is max number of events:» There are 2*J levels for the target ML_t ( when t J)» Use MLM for smaller J?» Need more applications / simulations to give good guidance I have a successful tests of MLM for J = 8 37

38 Collapsing Levels for MLM? Consider target variable ML_3 ID t=1 t=2 t=3 ML_ PROC LOGISTIC DATA = MLM; MODEL ML_3(ref= 5") = X / LINK = GLOGIT; Could we collapse 15 and 25 into, say 99 and reduce the complexity of the model?» Bad for model fit. 15 and 25 have different meanings. 38

39 Same Target Value with Different Histories? ID Target 1 Target Does predictive accuracy suffer if Target 1 is used for q 2,6? Instead for Target 2, q 2,6 = P(18) + + P(22) If we conclude we must use (and similar history coding for other target levels), then MLM becomes far less attractive. Note: On Training Dataset: Average(P(20)) equals Average(P(18) + + P(22)) 39

40 time Model #1 # Two models were fit: Model 1: Using 20 (and other levels) Model 2: Replacing 20 with 18 to 22 On Validation, q 2,6 from Model 1 was ranked Within each rank the absolute differences between Model1 and Model2 for q 2,6 were computed Multinomial and Histories A B C =C/B Rank Avg Avg. Abs. Diff Pct Model1 Model1 (Model1-Model2) Diff All % decile % decile % decile % decile % decile % decile % decile % decile % decile % decile % Simulation 40

41 The Choice: DTHM or MLM In simulations and test data (using the same X s)» Probabilities are very similar p j,t ~ q j,t I recommend trying both methods and comparing 41

42 How are Models Validated? The predictive accuracy of p j,t and q j,t are measured against actuals; e j,t = 1 if customer had j th event at t, otherwise 0. This is done on a validation sample 42

43 How are Models Validated? Lift Tables / Charts Profiles of Probabilities for fixed covariate values Fit model on Training Score Validation Dataset Via programming, compute p 1,1 = Prob(E 1,1 ) Rank customers using p 1,1 into 5 ranks (rank 1 for highest) e r = event rate for r th rank p r = avg prob for r th rank rank p r Mean (rank) e r Profile for X = x Cum Avg p Cum Avg e Cum Avg p 1,t v. Cum Avg e 1,t 43

44 How are Models Validated? Lift Tables Many!! If T = 6 and J = 4 then 18 lift tables for combinations of t and j Profiles of Probabilities for fixed covariate values Many!! 3 covariates with 4 levels 12 profiles for each J =J*12 Need a simple summary metric 44

45 Absolute Error Gives T numbers to measure model performance Each customer and t: Find absolute error between expected and actual events Suppose T =3 and J = 2. Here is a customer: t p 1,t Cum_p 1,t p 2,t Cum_p 2,t Expected(t) Cum Actual(t) Abs. Error = = =.1 Only T=3 numbers are produced All Customers t 1 2 T=3 Mean Abs. Error 45

46 Compare Many Models using Mean Absolute Error Idea #1: Rank models by Mean Absolute Error for T The errors will be greatest at the extreme T? Idea #2: Sum the errors across 1 to T and rank Mean Absolute Error Time Model #1 SUM Error

47 Challenges and Sum Up DTHM: Data requires unusual data structure Customer-Period MLM: Complicated programming to create targets for MLM The case of 20 vs is this a problem? several case studies say no Limitation of size of J J = 8 was successful in one test Model Fitting challenges how to efficiently fit models for all J (DTHM) or T (MLM)? Not discussed today. One approach is to create predictors and fit one model (J = 2 for DTHM and t = T and MLM) Write macros to fit models across all J and all T What is the best way to compare many candidate models suppose 20 models? Rank by absolute error Based on simulations and test data: DTHM and MLM produced very similar results 47

48 Contact Information Bruce Lund SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.

49 Lift Charts 49

50 Lift Charts for Probability: 1 st Event for t = 1 Fit model on Training Score Validation Dataset Via programming, compute p 1,1 = Prob(E 1,1 ) Rank customers using p 1,1 into 5 ranks (rank 1 for highest) e r = event rate for r th rank p r = avg prob for r th rank rank p r Mean (rank) e r

51 Three metrics are presented: 2 Separation = (e 1 e 5 ) + (e 2 e 4 ) = Monotonic Deviation = (e 2 -e 1 )*(e 2 > e 1 ) + 3 (e 3 -e 2 )*(e 3 > e 2 ) + (e 4 -e 3 )*(e 4 > e 3 ) + (e 5 -e 4 )*(e 5 > e 4 ) = 0 Lift Charts Metrics e r = event rate for r th rank p r = avg prob for r th rank rank p r Mean ABS (rank) e r (p r -e r ) Abs error = Separation = Mono = 0 Absolute Error = Abs(e 1 p 1 ) + + Abs(e 5 p 0 ) =

52 Multi v. DTHM: Prob for 1 st Event, t = 1 60% 40% 20% Multi: Prob 1st event at t=1 qr er Abs Error = Separation = Mono Dev. = 0 0% 60% 40% 20% 0% Quintile Ranks DTHM: Prob 1st event at t= Quintile Ranks pr er Multinomial is very slightly superior to Hazard: Lower Abs. Error Greater Separation Abs Error = Separation = Mono Dev. = 0 52

53 Lift charts for DTHM all j and t 60% 40% 20% 0% DTHM: Prob 1st event at t= Quintile Ranks pr er 6 lift charts for 1 st event (j=1, t=1 to 6) 5 lift charts for 2 nd event (j=2, t=2 to 6) Etc. 1 lift chart for 6 th event (j=6, t=6) Is there a summary metric? In total, 21 lift charts. Many! 53

54 Other Sides 54

55 DTHM: One Model instead of J Models? Possible to fit a single discrete time hazard model for repeated events but the target is occurrence of an event not j th event Observation is created for each time period for a customer. Target is 0 or 1 depending on whether there is an event. The spell s are covariates in these models (j th spell identifies periods before and including j th event). For prediction: Try to compute the hazard of j th event. Compute h j,t for times t by setting covariate spell = j in model equation I m unsuccessful in getting good h j,t by this approach See Allison for discussion of modeling repeated events with discrete time logistic models. The problem of dependence among the event times is discussed. Allison, P. D. (2010). Survival Analysis Using SAS : A Practical Guide, Chapter 8 55

56 Alternative (but not useful) method for DTHM A different formulation of the hazard model for 2 nd event (j = 2): Only observe after the time of experiencing E 1,t until either E 2,t occurs or until end of observation period is reached. Time event h* 2,t n/a n/a h* 2,2 h* 2,3 n/a This is risk set for 2 nd event (= those customers who have experienced the 1 st event). h* 2,t can be fit (first, must update X s to start time for risk). But h* 2,t do not lead to formulas for p 2,t (= probability density for E 2,t ) 56

57 Ideas for Validation Metrics Compare mean density to event rates for Customer groups across time: Suppose a predictor X1 has only a few distinct values. For a fixed value of X1 and for fixed j, the average probabilities p j,t (q j,t ) are computed and are compared with the averages of e j,t across t = 1 to T. The hoped-for outcome is that the average of p j,t (q j,t ) closely agree with average of e j,t across t = 1 to T for each value of X1 and for each j. The comparison p j,t (q j,t ) with the average of e j,t can be applied to covariate patterns (subsets of customers having identical predictor values) for the patterns where there is sufficient sample size to meaningfully compute the average of e j,t. Finally, the idea might be further extended by creating approximate covariate patterns where a pattern is defined by fixing values of discrete predictors and fixing the mid-points of ranks of the continuous predictors. The following discussion applies to DTHM. It also applies to MTM provided MTM uses same predictors across t. 57

58 Formula that Connects Hazards and Probabilities Let f(t) be the probability density across values of time t where t = 1, The cumulative distribution for f(t) is F(t) = σt i=1 f(i) and the survival function is S(t) = 1 - F(t) The hazard function h(t) is given by h(t) = f(t) / S(t-1) with the convention that S(0) = 1 Derivation: (A) 1 - h(t) = 1 - f(t) / S(t-1) = [S(t-1) - f(t)] / S(t-1) Note: S(t-1) - f(t) = S(t) (B) Multiply (A) by S(t-1) to give: S(t-1) (1 - h(t)) = S(t) t Iterate the formula (B) for t-1, t-2, 1 to give: ς i=1 (1 h i ) = S(t) F(t) = 1 - S(t) = 1 - ςt i=1 (1 h i ) Therefore, f(t) = F(t) - F(t-1) = h(t) ς t 1 i=1 (1 h i ) 58

59 σt t=1 p j,t < 1 G. Thomas (1957) shows for any j and any customer that the sum of the densities satisfies: σ T t=1 p j,t < 1. From this point on, j is fixed and does not enter into the logic. The proof recasts the problem in terms of probabilities. Assume that h j,t gives the probability of heads for the (independent) toss of the t th unfair coin from a sequence of coins. The product t t =1(1 h j,t ) is the probability that no heads occur in the first t tosses. The product p j,t = h j,t t-1 t =1(1 h j,t ) is probability that first head occurs at the t th toss. σt t =1 p j,t is the probability of at least one head in the first t tosses. Therefore, σt t =1 p j,t + t t =1(1 h j,t ) = 1 for all t. Since t t =1(1 h j,t ) < 1, it follows that σt t =1 p j,t < 1 for any t. Thomas, G. (1957). Probability of Sums of Series, American Mathematical Monthly, 64,

60 Example where σ T t=1 q j,t > 1 #1 Data example; input Cust_ID t X1 datalines; ; Proc Sort data = example; by Cust_ID t; Data ML; set example; by Cust_ID t; array Yt{*} Y1 - Y2; array X1t{*} X1_1 - X1_2; retain Y1 - Y2 X1_1 - X1_2; Yt{t} = Y; X1t{t} = X1; if last.cust_id then do; ML_1 = 0; ML_2 = 0; do i = 1 to 2; ML_2 = ML_2 + 10*Yt{i}*(i < 3) + (i = 2)*(Yt{2} = 0)*(ML_2 > 0)*5; ML_1 = ML_1 + 10*Yt{i}*(i < 2); end; output; end; 60

61 Example where σ T t=1 q j,t > 1 #2 Proc Logistic data = ML; model ML_1(ref = "0")= / link = glogit; score data = ML out = scored_ml_1(rename=(p_10 = q11)); Proc Logistic data = ML; model ML_2(ref = "0")= X1_2 / link = glogit; score data = ML out = scored_ml_2(rename=(p_10 = q12)); Data q11_q12; merge scored_ml_1 scored_ml_2; by Cust_ID; sum_q11_q12 = q11 + q12; Proc Print data = q11_q12; Var Cust_ID q11 q12 sum_q11_q12; run; This example is highly contrived. Cust_ID s 3 and 4 have cum. density > 1 Obs Cust_ID q11 q12 sum_q11_q

62 Unlikely that: σ T t=1 q j,t > 1 -- provided the model is good Consider j = 1 and Obs = 1 Time 1 2 Obs = Convert to values of ML_1 and ML_2: Obs = Let X be a covariate with X=5 for Obs = 1. Further assume X=5 is strongly predictive of an event across all times: Then q 1,1 will be large Also q 2,2 will be large, but then, necessarily, q 1,2 will be small. This is a loose argument that σ2 t=1 assuming good models. q 1,t < 1 but it conveys the idea that the data restrain σq j,t, If σ T t=1 q j,t > 1, then it is likely that the MLM models have poor fit for one or more values of t. 62

63 Sum p j,t Across t vs. Sum p j,t Across j The summation ( p 1,1 + p 1,2 + + p 1,t0 ) gives probability of customer having first event by period t0. In contrast, consider the summation (where p j,t0 = 0 if j > t0) SUM_J(t0) = ( p 1,t0 + p 2,t0 + + p J,t0 ) This gives probability of a customer having an event in period t0. Therefore, SUM_J(t0) is an in-market model for period t0. But this is a complicated way to obtain this simple model. 63

64 Independence of Irrelevant Alternatives (IIA) For multinomial logistic regression the log-odds of alternatives j and k for the i th customer is given by log( P ij / P ik ) = ( β j β k ) x i (IIA) In IIA the log-odds of alternatives j and k for the i th customer involve the coefficients for alternatives j and k as well as the customer predictor values of x i but involve no other alternatives. This restrictive condition may not be appropriate for some models. Tests of the suitability of multinomial logistic regression, including violations of IIA, are performed by the Hausman s Specification Test and the Small and Hsiao Likelihood Ratio Test. SAS implementations of these tests are discussed in SAS documentation of PROC MDC. Findings reported by Cheng and Long (2006) show these tests to be unreliable for large-scale applications. A related short discussion of testing of the suitability of multinomial logistic regression is given by Paul Allison (2012). Alternatives to the multinomial logistic model for LTV include more generalized discrete choice models such as the heteroscedastic extreme value model and nested-logit model. Neither of these models is subject to the IIA condition. These and other discrete choice models can be fitted by PROC MDC. Allison, P. (2012). How Relevant Is The Independence Of Irrelevant Alternatives?, Oct 12, 2012, Statistical Horizons. Available At: Cheng, S. and Long, J. S. (2006). Testing for IIA in the Multinomial Logit Model, Sociological Methods & Research: 35:

65 Sample Size Discussion General: If counts of observations in the database of event J are low, then modeling J is (probably) not important. Go back to J-1. Hazard: If modeling j th event: Rough rule is 10 or 15 observations of j per predictor coefficient. If 10 coefficients, then 100 or 150 observations of j If needed, non-events can be randomly sampled to reduce size of dataset. At least 30 observations for each time. If T = 10 and j = 3, then require 30 for each of j = 3,, 10 (240 observations of jth event, jth = 3) Multinomial: If modeling 2*j levels: If 2*j levels of target and K predictors, then each target level uses K coefficients. For target level g (versus g 0, a reference level) Rough rule for sample size is (10 or 15 observations) * K * 2*j-1. Log(odds) g = k=1k b g *x g,k If needed, high count levels of target can be randomly sampled to reduce size of dataset. Extreme Fix: If count of observations of g is small, could fit binary logistic of g 0 (a reference level) to g for each g beyond g 0. For g with small counts, only a few parameters would be used in that model. The 2*j - 1 regressions can be re-combined, as in the paper of Begg and Gray, to a multinomial model. Begg, C. B. and Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions, Biometrika 71, 1, pp

66 The Start Date? Calendar time week10 week11 week12 week13 week14 Time from start Cust 01 events Calendar time week10 week11 week12 week13 week14 Time from start Cust 02 events Start a customer one week after the customer had an event? But some customers do not have a prior event Start from the customer-date (first association with customer) Might be a long time ago events are different now Start date is an important decision, no general rules In CRM models: Using same start date is natural and effective 66

67 Total Error for 1 st Event - a Heuristic ID p11 p12 p13 p14 p15 p16 e11 e12 e13 e14 e15 e16 A B A: Total Error for ID s with no 1 st Event: Assumes 1 st Event occurs at t=7 Weight = (7 t) for t = 1 to 6 Error = Weight(t)*Prob(t) Not a Bad Error t Wgt Prob Error TOTAL ERROR= Bad Error t Wgt Prob Error TOTAL ERROR= B: Total Error for ID s with 1 st Event: 1 st Event occurs at t 0 Weight = abs(t 0 t) for t = 1 to 6 Error = Weight(t)*Prob(t) 67

68 ID p55 p56 e55 e56 A B t Wgt Prob Error TOTAL ERROR= Total Error for 5 th Event A: Total Error for ID s with no 5 th Event: Assumes 5 th Event occurs at t=7 Weight = (7 t) for t = 5 to 6 Error = Weight(t)*Prob(t) t Wgt Prob Error TOTAL ERROR= B: Total Error for ID s with 5 th Event: 5 th Event occurs at t 0 Weight = abs(t 0 t) for t = 5 to 6 Error = Weight(t)*Prob(t)

69 Hazard Model No j th Event Avg total error j= j= j= j= j= j= j th Event Occur Avg total error j= j= j= j= j= Total Error: DTHM v. Multinomial Multinomial No j th Event Avg total error j th Event Occur Avg total error difference difference Multinomial has smaller total error for no events : j = 1 to 4 MLM for j = 1 is appears to be better than DTHM In this example, Multinomial is slightly better than DTHM

70 References There are hundreds of good references on the topic of discrete time hazard models (DTHM). I ve listed two references that introduce the DTHM in a readable and thoughtful style. Allison, P. D. (2010). Survival Analysis Using SAS : A Practical Guide, Second Edition, Cary, NC, SAS Institute. See chapter 7 Singer, J. D. and Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence, New York, Oxford University Press. See chapters 9-12 Training Course: For SAS users I recommend Survival Data Mining: A Programming Approach The second method in this talk for finding the densities for repeated events is based on a multinomial logistic model. I ve not seen papers or books that use the multinomial logistic model for this purpose. I ve listed Paul Allison s book on Logistic Regression as a readable and thoughtful introduction to logistic regression including the multinomial model. Allison, P. (2012). Logistic Regression Using SAS, Cary NC, SAS Institute. The ideas in this talk are drawn from two papers I ve presented at SAS Global Forums: Lund B. (2016). Probability Density for Repeated Events, Proceedings of the SAS Global Forum 2016 Conference, Paper Lund B. (2015) Multinomial Logistic Model for Long-Term Value, Proceedings of the SAS Global Forum 2015 Conference, Paper

MSUG conference June 9, 2016

Weight of Evidence Coded Variables for Binary and Ordinal Logistic Regression Bruce Lund Magnify Analytic Solutions, Division of Marketing Associates MSUG conference June 9, 2016 V12 web 1 Topics for this