Bayesian non-parametric model to longitudinally predict churn

Size: px

Start display at page:

Download "Bayesian non-parametric model to longitudinally predict churn"

Kenneth Fields
5 years ago
Views:

European Statistics Stakeholders Methodologists,

1 Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics Rome, 25 November 2014

2 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

3 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

4 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

5 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

6 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

7 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

8 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

9 Churn analysis A typical problem for many companies with quite a large customer base is the evaluation of customer loyalty which customers are most likely to abandon the company? These customers are often described as being churners. This problem is prominent in sectors in which customers have ongoing relationships with companies, i.e. services companies: banks, insurance companies, telecommunications services, etc.. Good models are needed for predicting deactivation (churn) by their customers, to be able to carry out appropriate retention actions later on. A model is needed not only to fit the data and predict future data, but also, possibly, to indicate marketing actions, e.g., customer retention strategies

10 Churn analysis Goal: find for each customer a score of propensity to churn Understand which variables have effect on the customer decision to churn and measure this effect It is more important to understand effects than accuracy in prediction Typically a data mining model is fitted to a random sample of customer base data In this paper we consider the prediction of churn for the customer base of a telecommunication company

11 Churn analysis Goal: find for each customer a score of propensity to churn Understand which variables have effect on the customer decision to churn and measure this effect It is more important to understand effects than accuracy in prediction Typically a data mining model is fitted to a random sample of customer base data In this paper we consider the prediction of churn for the customer base of a telecommunication company

12 Churn analysis Goal: find for each customer a score of propensity to churn Understand which variables have effect on the customer decision to churn and measure this effect It is more important to understand effects than accuracy in prediction Typically a data mining model is fitted to a random sample of customer base data In this paper we consider the prediction of churn for the customer base of a telecommunication company

13 Churn analysis Goal: find for each customer a score of propensity to churn Understand which variables have effect on the customer decision to churn and measure this effect It is more important to understand effects than accuracy in prediction Typically a data mining model is fitted to a random sample of customer base data In this paper we consider the prediction of churn for the customer base of a telecommunication company

14 Churn analysis Goal: find for each customer a score of propensity to churn Understand which variables have effect on the customer decision to churn and measure this effect It is more important to understand effects than accuracy in prediction Typically a data mining model is fitted to a random sample of customer base data In this paper we consider the prediction of churn for the customer base of a telecommunication company

15 Sources socio demographic data subscription data usage & network data call center data (calls, complains, billing problems)

16 Longitudinal data In service companies data are often collected in different time intants. For example monthly telephone traffic is considered an important predictor for churn These are longitudinal data that are rarely considered (in this form) to predict churn (only sort of index number are typically used). A possibility to handle this type of data consists in considering traffic as a functional data We can use tools to analyse the relationship between a functional predictor and a binary response

17 Longitudinal data In service companies data are often collected in different time intants. For example monthly telephone traffic is considered an important predictor for churn These are longitudinal data that are rarely considered (in this form) to predict churn (only sort of index number are typically used). A possibility to handle this type of data consists in considering traffic as a functional data We can use tools to analyse the relationship between a functional predictor and a binary response

18 Longitudinal data In service companies data are often collected in different time intants. For example monthly telephone traffic is considered an important predictor for churn These are longitudinal data that are rarely considered (in this form) to predict churn (only sort of index number are typically used). A possibility to handle this type of data consists in considering traffic as a functional data We can use tools to analyse the relationship between a functional predictor and a binary response

19 Longitudinal data In service companies data are often collected in different time intants. For example monthly telephone traffic is considered an important predictor for churn These are longitudinal data that are rarely considered (in this form) to predict churn (only sort of index number are typically used). A possibility to handle this type of data consists in considering traffic as a functional data We can use tools to analyse the relationship between a functional predictor and a binary response

20 Goal of Analysis Determine whether patterns of phone traffic (number, duration, value of montly calls) are related to churn. Here, our outcome, churn, is univariate, and our predictor, phone traffic, is longitudinal. How to characterize pattern of traffic? number, duration and value measures of traffic are examples of functional predictors: a random curve that varies over time, space, or some other domain, with observations at every point in the domain that may only be measured at a finite set of points.

21 Goal of Analysis Determine whether patterns of phone traffic (number, duration, value of montly calls) are related to churn. Here, our outcome, churn, is univariate, and our predictor, phone traffic, is longitudinal. How to characterize pattern of traffic? number, duration and value measures of traffic are examples of functional predictors: a random curve that varies over time, space, or some other domain, with observations at every point in the domain that may only be measured at a finite set of points.

22 Goal of Analysis Determine whether patterns of phone traffic (number, duration, value of montly calls) are related to churn. Here, our outcome, churn, is univariate, and our predictor, phone traffic, is longitudinal. How to characterize pattern of traffic? number, duration and value measures of traffic are examples of functional predictors: a random curve that varies over time, space, or some other domain, with observations at every point in the domain that may only be measured at a finite set of points.

23 General problem and data structure Interest: relationship between functional predictor f i and the response z i (inference and prediction) Predictor f i takes value f i (t) at location t {1,..., T }. Data consist of {y i, z i } n i=1, with y i = (y i1,..., y it ) T. y ij = error-prone measure of f i (t ij ) (telephone traffic at month t ij ) t ij = location (or time) of observation j n i = number of observations on subject i z i = response variable (churn) in addition we may have x i = static predictors variable (age, sex,... )

24 General problem and data structure Interest: relationship between functional predictor f i and the response z i (inference and prediction) Predictor f i takes value f i (t) at location t {1,..., T }. Data consist of {y i, z i } n i=1, with y i = (y i1,..., y it ) T. y ij = error-prone measure of f i (t ij ) (telephone traffic at month t ij ) t ij = location (or time) of observation j n i = number of observations on subject i z i = response variable (churn) in addition we may have x i = static predictors variable (age, sex,... )

25 General problem and data structure Interest: relationship between functional predictor f i and the response z i (inference and prediction) Predictor f i takes value f i (t) at location t {1,..., T }. Data consist of {y i, z i } n i=1, with y i = (y i1,..., y it ) T. y ij = error-prone measure of f i (t ij ) (telephone traffic at month t ij ) t ij = location (or time) of observation j n i = number of observations on subject i z i = response variable (churn) in addition we may have x i = static predictors variable (age, sex,... )

26 General problem and data structure Interest: relationship between functional predictor f i and the response z i (inference and prediction) Predictor f i takes value f i (t) at location t {1,..., T }. Data consist of {y i, z i } n i=1, with y i = (y i1,..., y it ) T. y ij = error-prone measure of f i (t ij ) (telephone traffic at month t ij ) t ij = location (or time) of observation j n i = number of observations on subject i z i = response variable (churn) in addition we may have x i = static predictors variable (age, sex,... )

27 Latent class trajectory model Group-based trajectory models are used to identify clusters of subjects following similar trajectories over time. While we may not believe that each subject s phone traffic measures exactly follow one of K curves, this may be a very useful summary of the data.

28 Latent class trajectory model Group-based trajectory models are used to identify clusters of subjects following similar trajectories over time. While we may not believe that each subject s phone traffic measures exactly follow one of K curves, this may be a very useful summary of the data.

29 Issues Good choice of parametric form for latent trajectory curves often unclear (prefer a nonparametric form?) iid N(0, σ 2 ) residuals restrictive - may imply lots of latent classes Number of latent classes unknown - BIC criteria may be poor

30 Issues Good choice of parametric form for latent trajectory curves often unclear (prefer a nonparametric form?) iid N(0, σ 2 ) residuals restrictive - may imply lots of latent classes Number of latent classes unknown - BIC criteria may be poor

31 Issues Good choice of parametric form for latent trajectory curves often unclear (prefer a nonparametric form?) iid N(0, σ 2 ) residuals restrictive - may imply lots of latent classes Number of latent classes unknown - BIC criteria may be poor

32 Our interest Use semiparametric Bayes joint modelling framework For ease in interpretation, group individuals into functional predictor clusters (patterns of traffic), with number of clusters not specified in advance Allow response (churn) distribution to vary nonparametrically across clusters Conduct inferences on changes in churn

33 Our interest Use semiparametric Bayes joint modelling framework For ease in interpretation, group individuals into functional predictor clusters (patterns of traffic), with number of clusters not specified in advance Allow response (churn) distribution to vary nonparametrically across clusters Conduct inferences on changes in churn

34 Our interest Use semiparametric Bayes joint modelling framework For ease in interpretation, group individuals into functional predictor clusters (patterns of traffic), with number of clusters not specified in advance Allow response (churn) distribution to vary nonparametrically across clusters Conduct inferences on changes in churn

35 Our interest Use semiparametric Bayes joint modelling framework For ease in interpretation, group individuals into functional predictor clusters (patterns of traffic), with number of clusters not specified in advance Allow response (churn) distribution to vary nonparametrically across clusters Conduct inferences on changes in churn

36 The data 3000 post paid SIM cards number of outgoing calls for 9 consecutive months socio-demographical (sex, age,... ) and relate to the contract (services, payment method,... ) characteristics churn status (active/deactivated) after three months

37 From data to model Flexibility to get irregularities, if present non parametric modelling Estimate variability between functional curves and between output distributions

38 Bayesian non parametric approach We follow Bigelow and Dunson (2009) with some modification: we use Gaussian processes as baseline measures (B&D2009 uses Splines functions) in the estimate algorithm we use a nested Metropolis-Hastings algorithm by-product: functional clustering

39 Joint model Joint modelling of {y i, x i, z i } n i=1 1 specification of a model for each component y i and z i x i 2 specification of a joint prior for the parameters of the two models

40 Components of the model Model for the output (churn) - GLM z i Bin(1, π i ) π i = eξ i Model for the trajectory 1 + e ξ i ξ i = a i + x T i γ y i (t) = f i (t) + ε it ε it N (0, τ 1 ) f i G Joint model θ i = {f i, a i } P Dependence between functional predictor f i Ω & response z i R characterised through P P = random probability measure on (R T +1, B)

41 Components of the model Model for the output (churn) - GLM z i Bin(1, π i ) π i = eξ i Model for the trajectory 1 + e ξ i ξ i = a i + x T i γ y i (t) = f i (t) + ε it ε it N (0, τ 1 ) f i G Joint model θ i = {f i, a i } P Dependence between functional predictor f i Ω & response z i R characterised through P P = random probability measure on (R T +1, B)

42 Components of the model Model for the output (churn) - GLM z i Bin(1, π i ) π i = eξ i Model for the trajectory 1 + e ξ i ξ i = a i + x T i γ y i (t) = f i (t) + ε it ε it N (0, τ 1 ) f i G Joint model θ i = {f i, a i } P Dependence between functional predictor f i Ω & response z i R characterised through P P = random probability measure on (R T +1, B)

43 Components of the model Model for the output (churn) - GLM z i Bin(1, π i ) π i = eξ i Model for the trajectory 1 + e ξ i ξ i = a i + x T i γ y i (t) = f i (t) + ε it ε it N (0, τ 1 ) f i G Joint model θ i = {f i, a i } P Dependence between functional predictor f i Ω & response z i R characterised through P P = random probability measure on (R T +1, B)

44 Gaussian process To simplify modelling we express function f i as a Gaussian process f i (t) GP(µ, C) where µ is the mean function and C is the covariance function Considering the discrete sequences of times (9 observations in our data), the GP induces a multivariate normal distributions on the observed points of the process, f 1,..., f T.

45 Gaussian process To simplify modelling we express function f i as a Gaussian process f i (t) GP(µ, C) where µ is the mean function and C is the covariance function Considering the discrete sequences of times (9 observations in our data), the GP induces a multivariate normal distributions on the observed points of the process, f 1,..., f T.

46 Gaussian process Samples from a GP can take a very wide variety of shapes that have limited sensitivity to the mean function. we allow an unknown, fixed mean, to avoid sensitivity to the scale of the phone traffic (still allows a very wide variety of trajectory shapes) The covariance function C controls the types of shapes observed. We used the exponential covariance function, as it allows a wide variety of functional shapes (squared exponential may overly favour smooth functions). C(t, t ) = 1 ( exp t ) t κ 1 κ 2 where κ 1 and κ 2 are unknown parameters.

47 Gaussian process Samples from a GP can take a very wide variety of shapes that have limited sensitivity to the mean function. we allow an unknown, fixed mean, to avoid sensitivity to the scale of the phone traffic (still allows a very wide variety of trajectory shapes) The covariance function C controls the types of shapes observed. We used the exponential covariance function, as it allows a wide variety of functional shapes (squared exponential may overly favour smooth functions). C(t, t ) = 1 ( exp t ) t κ 1 κ 2 where κ 1 and κ 2 are unknown parameters.

48 Gaussian process Samples from a GP can take a very wide variety of shapes that have limited sensitivity to the mean function. we allow an unknown, fixed mean, to avoid sensitivity to the scale of the phone traffic (still allows a very wide variety of trajectory shapes) The covariance function C controls the types of shapes observed. We used the exponential covariance function, as it allows a wide variety of functional shapes (squared exponential may overly favour smooth functions). C(t, t ) = 1 ( exp t ) t κ 1 κ 2 where κ 1 and κ 2 are unknown parameters.

49 Gaussian process Samples from a GP can take a very wide variety of shapes that have limited sensitivity to the mean function. we allow an unknown, fixed mean, to avoid sensitivity to the scale of the phone traffic (still allows a very wide variety of trajectory shapes) The covariance function C controls the types of shapes observed. We used the exponential covariance function, as it allows a wide variety of functional shapes (squared exponential may overly favour smooth functions). C(t, t ) = 1 ( exp t ) t κ 1 κ 2 where κ 1 and κ 2 are unknown parameters.

50 Dirichlet process joint models θ i = {f i, a i } P A natural approach is to let P be unknown with P DP(αP 0 ) DP(αP 0 ) = denotes the Dirichlet process (Ferguson, 1973) with α: precision parameter P 0 : base probability measure

51 Dirichlet process joint models θ i = {f i, a i } P A natural approach is to let P be unknown with P DP(αP 0 ) DP(αP 0 ) = denotes the Dirichlet process (Ferguson, 1973) with α: precision parameter P 0 : base probability measure

52 Dirichlet process joint models θ i = {f i, a i } P A natural approach is to let P be unknown with P DP(αP 0 ) DP(αP 0 ) = denotes the Dirichlet process (Ferguson, 1973) with α: precision parameter P 0 : base probability measure

53 Dirichlet process joint models Stick-breaking representation (Sethuraman, 1994): P = π h δ θ h, θh P 0 h=1 δ θ : Dirac probability measure on the atom θ h 1 π h = V h (1 v l ) V h Beta(1, α) iid l=1

54 Dirichlet process joint models P 0 GP(µ, C) N (0, ν 1 ) P 0 N T +1 ([ µ 0 ] [ C 0, 0 T ν 1 ])

55 Dirichlet process joint models P 0 GP(µ, C) N (0, ν 1 ) P 0 N T +1 ([ µ 0 ] [ C 0, 0 T ν 1 ])

56 Prior distributions A Bayesian specification is completed with priors ν γ(a ν, b ν ) precision response component τ γ(a τ, b τ ) precision error for predictor component κ 1 γ(a κ1, b κ1 ) covariance function for GP κ 2 γ(a κ2, b κ2 ) covariance function for GP γ l N (γ 0, η 1 l ), static variable effects η l γ(a η, b η ), l = 1,..., p, variances for static variable effects

57 Dirichlet process joint models This DP prior induces the following Blackwell & MacQueen (1973) rule: ( ) α i 1 ( (θ i θ 1,..., θ i 1 ) P 0 + α + i 1 δ θ = measure concentrated at θ. Pólya Urn scheme j=1 1 α + i 1 ) δ θj

58 Comments on DP joint models Subjects automatically grouped into an unknown number of functional trajectory clusters. Cluster h has functional trajectory f (t) = GP(µ h, C) & response density Bin(1, ah + xγ h ) Marginal density of z is a mixture of Bernoulli Within predictor cluster, density of z is a single Bernoulli The DP assumes identical clusters in the predictor & response

59 Comments on DP joint models Subjects automatically grouped into an unknown number of functional trajectory clusters. Cluster h has functional trajectory f (t) = GP(µ h, C) & response density Bin(1, ah + xγ h ) Marginal density of z is a mixture of Bernoulli Within predictor cluster, density of z is a single Bernoulli The DP assumes identical clusters in the predictor & response

60 Comments on DP joint models Subjects automatically grouped into an unknown number of functional trajectory clusters. Cluster h has functional trajectory f (t) = GP(µ h, C) & response density Bin(1, ah + xγ h ) Marginal density of z is a mixture of Bernoulli Within predictor cluster, density of z is a single Bernoulli The DP assumes identical clusters in the predictor & response

61 Comments on DP joint models Subjects automatically grouped into an unknown number of functional trajectory clusters. Cluster h has functional trajectory f (t) = GP(µ h, C) & response density Bin(1, ah + xγ h ) Marginal density of z is a mixture of Bernoulli Within predictor cluster, density of z is a single Bernoulli The DP assumes identical clusters in the predictor & response

62 Comments on DP joint models Subjects automatically grouped into an unknown number of functional trajectory clusters. Cluster h has functional trajectory f (t) = GP(µ h, C) & response density Bin(1, ah + xγ h ) Marginal density of z is a mixture of Bernoulli Within predictor cluster, density of z is a single Bernoulli The DP assumes identical clusters in the predictor & response

63 Posterior distribution Gibbs sampling is straightforward to implement, involving simple steps for sampling from standard distributions Highly computationally intensive P is almost certainly discrete clustering of the sample units (customers) without first specifying the number of groups MCMC algorithm (Pólya Urn + Gibbs sampler + Metropolis-Hastings) At each iteration 1 Allocate the units in the groups 2 Update group parameters Nested Metropolis-Hastings 3 Update prior distributions iperparameters

64 Posterior distribution Gibbs sampling is straightforward to implement, involving simple steps for sampling from standard distributions Highly computationally intensive P is almost certainly discrete clustering of the sample units (customers) without first specifying the number of groups MCMC algorithm (Pólya Urn + Gibbs sampler + Metropolis-Hastings) At each iteration 1 Allocate the units in the groups 2 Update group parameters Nested Metropolis-Hastings 3 Update prior distributions iperparameters

65 Posterior distribution Gibbs sampling is straightforward to implement, involving simple steps for sampling from standard distributions Highly computationally intensive P is almost certainly discrete clustering of the sample units (customers) without first specifying the number of groups MCMC algorithm (Pólya Urn + Gibbs sampler + Metropolis-Hastings) At each iteration 1 Allocate the units in the groups 2 Update group parameters Nested Metropolis-Hastings 3 Update prior distributions iperparameters

66 Posterior distribution Gibbs sampling is straightforward to implement, involving simple steps for sampling from standard distributions Highly computationally intensive P is almost certainly discrete clustering of the sample units (customers) without first specifying the number of groups MCMC algorithm (Pólya Urn + Gibbs sampler + Metropolis-Hastings) At each iteration 1 Allocate the units in the groups 2 Update group parameters Nested Metropolis-Hastings 3 Update prior distributions iperparameters

67 Results interpretation Label switching a pain! number and composition of groups changes among algorithm iterations output not directly usable for clustering can be addressed by post-processing (Medvedovic and Sivaganesan, 2002), but adds considerably to computational burden 1 Obtain a distance matrix between sample units by using the posterior output 2 hierarchical clustering with complete link

68 Results interpretation Label switching a pain! number and composition of groups changes among algorithm iterations output not directly usable for clustering can be addressed by post-processing (Medvedovic and Sivaganesan, 2002), but adds considerably to computational burden 1 Obtain a distance matrix between sample units by using the posterior output 2 hierarchical clustering with complete link

69 Results interpretation Label switching a pain! number and composition of groups changes among algorithm iterations output not directly usable for clustering can be addressed by post-processing (Medvedovic and Sivaganesan, 2002), but adds considerably to computational burden 1 Obtain a distance matrix between sample units by using the posterior output 2 hierarchical clustering with complete link

70 Results interpretation Label switching a pain! number and composition of groups changes among algorithm iterations output not directly usable for clustering can be addressed by post-processing (Medvedovic and Sivaganesan, 2002), but adds considerably to computational burden 1 Obtain a distance matrix between sample units by using the posterior output 2 hierarchical clustering with complete link

71 Results interpretation Label switching a pain! number and composition of groups changes among algorithm iterations output not directly usable for clustering can be addressed by post-processing (Medvedovic and Sivaganesan, 2002), but adds considerably to computational burden 1 Obtain a distance matrix between sample units by using the posterior output 2 hierarchical clustering with complete link

72 Estimated trajectories and probabilities

73 some cluster

74 The static variables p1, p2, p3, p4, p5: Tariff plan m1, m2, m3: Payment method e1, e2, e3: Age

75 Lift improvement factor balanced logistic model not balanced logistic model balanced linear model not balanced linear model discriminant analysis classification tree MARS GAM SVM random forest bagging boosting fraction of predicted subjects

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(