NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 1 / 25

Market Basket Data A store sells a large set of products P = {p 1,..., p d }. A transaction (basket) t i P contains the products bought by a customer during a particular visit to the store. The transactions t 1,..., t n can be encoded as a binary matrix X. X can be very large, e.g. 10 8 10 4. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 2 / 25

Market Basket Analysis (MBA) and Association Rules MBA allows us to identify patterns in customer purchases. Ideally we would like to answer questions like: What products are usually bought together? What products may benefit from promotion? What are the best cross-selling opportunities? Association Rules is a popular method for MBA [Agrawal et al. 1994]. Generates rules of the form A B, where A, B P and A B =. A B means that if A t holds, then we should expect B t to hold also, with high probability. {peanut butter, jelly} {bread} Problem: The number of possible rules grows exponentially with d. Solution: filter the rules using minimum support and confidence thresholds. support(a B) = P(A B t). confidence(a B) = P(B t A t). J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 3 / 25

Some Disadvantages of Association Rules (ARules) No obvious procedure for selecting support and confidence values. Too large and many interesting associations can be missed. Too small and we obtain an explosion of non-significant rules. Arules usually generates a very large number of rules. Identifying the few interesting rules among the many obvious or redundant ones can be difficult. Importantly, ARules, as an unsupervised learning method, is usually outperformed by other techniques when making predictions. This means that there are some patterns in the data which are not fully captured by ARules. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 4 / 25

NetBox: A Probabilistic Method for MBA I NetBox addresses the previous disadvantages of ARules as follows: NetBox follows a Bayesian approach. Any hyper-parameter value is either marginalized out or tuned automatically to the data without any human supervision. Instead of rules, NetBox generates a network of products [Raeder and Chawla, 2011]. The networks generated often contain several connected compoments or clusters of products. By focusing on these clusters, we avoid to examine huge lists with many redundant or non-interesting rules. NetBox has better predictive performance than ARules and it is competitive or better than alternative state-of-the-art methods at a lower computational cost. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 5 / 25

NetBox: A Probabilistic Method for MBA II Let P x be an ideal distribution such that any arbitrary row x = (x 1,..., x d ) T of the transaction matrix X is sampled from P x. We want to specify a model for P x that can be adjusted to the available data. For this, we follow the framework of dependency networks [Heckerman et al. 2001] and attempt to learn the conditional distributions P(x 1 x 1 ),..., P(x d x d ). We assume that each conditional P(x i x i ) is a mixture of the predictive distributions of different models. In its current form, NetBox mixes the prediction of two models: A sparse binary classifier (NetBox-SBC). A conditional model based on matrix factorizations (NetBox-CMF). J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 6 / 25

NetBox-SBC P(x i w, ɛ, x i ) = ɛ + (1 2ɛ)Θ[(2x i 1)(x i w d + w d )], P(w z) = d i=1 [z in (w i 0, v) + (1 z i )δ(w i )], P(z) = d i=1 Bern(z i p i ), P(ɛ) = Beta(ɛ a 0, b 0 ), where a 0 = 1, b 0 = 9, p 1,..., p d 1 = 0.5 and p d = 1. The posterior distribution is approximated by Q(w, ɛ, z) = Beta(ɛ ã, b) d i=1 [N (w i m i, ṽ i )Bern(z i p i )] using assumed density filtering [Opper, 1998]. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 7 / 25

NetBox-CMF P(X U, V) = n i=1 d j=1 N (x i,j u i v T j, σ2 ), P(U) = n i=1 k j=1 N (u i,j 0, t U j ), P(V) = d i=1 k j=1 N (v i,j 0, s V j ), The posterior distribution is approximated by [ n ] [ Q(U, V) = k i=1 j=1 N (u i,j m i,j U, ṽ i,j U ) d ] k i=1 j=1 N (v i,j m i,j V, ṽ i,j V ) using variational Bayes and the analytic method of Nakajima et al. 2010. The conditional is modeled assuming P(x i x i, w ) = N (x i x i w, σ 2 ). The posterior of w is approximated with Q(w ) = d 1 i=1 N (w i m i, ṽ i ). by matching the predictive mean and variance of the MF model. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 8 / 25

Model Mixing We compute the average log marginal likelihood on the available data: l SBC i = n 1 n j=1 log [x j,ip SBC (x j,i = 1 x j, i ) + (1 x j,i )(1 P SBC (x j,i = 1 x j, i ))] l CMF i = n 1 n j=1 log P CMF(x j,i x j, i ) Let π i be the mixing weight for NetBox-SBC. Then, we estimate π i as ˆπ i = exp(l SBC i )[exp(l SBC i ) + exp(l CMF i )] 1, Finally, we generate predictions using P NetBox (x i = 1 x i ) = ˆπ i P SBC (x i = 1 x i ) + (1 ˆπ i )x T i m. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data 22, 2012 9 / 25

Generating a Network of Products We assign a weight w(j, i) to the edge connecting products j and i as w(j, i) = P NetBox (x i = 1 x j = 1, x j = 0) P NetBox (x i = 1 x i = 0). We identify the relevant connections using a statistical test: We generate X rand with the same marginals as X but independent entries. NetBox is run on X Rand to obtain a collection of weights w Rand (j, i). Critical values are obtained by fitting a GPD to {w Rand (k, i) : k = 1,..., d}. We set to zero the non-significant weights. Finally, we prune edges to maximize the number of connected components in the network. Density 0 20 40 60 80 100 0.02 0.00 0.01 0.02 0.03 0.04 J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 10 / 25

Evaluation of the Prediction Accuracy of NetBox Data split into disjoint sets of training and test transactions. A 15% of the products in the test transactions are eliminated. We try to identify the products missing from each test transaction. Preformance measure: recall at 10. Benchmark methods: Association rules (Arules). Asymetric matrix factorization (AMF) [Pan et al, 2009]. Rank optimized matrix factorization (ROMF) [Rendle et al, 2009]. Ranking based on frequency (Freq). J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 11 / 25

Results J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 12 / 25

Networks of Products J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 13 / 25

Rules Generated by ARules in the Small Netflix Dataset ARules generated more than 100,000 rules. We list the top rules according to lift. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 14 / 25

More Rules... J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 15 / 25

And More Rules... J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 16 / 25

Top Connected Components NetBox Netflix Dataset J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 17 / 25

Top Frequent Itemsets MaxEnt Netflix Dataset J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 18 / 25

Top Connected Components NetBox Pubmed Dataset J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 19 / 25

Top Frequent Itemsets MaxEnt Pubmed Dataset J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 20 / 25

Top Connected Components NetBox Books Dataset J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 21 / 25

Top Frequent Itemsets MaxEnt Books Dataset J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 22 / 25

Conclusions NetBox is a probabilistic method for market basket analysis which: Follows a Bayesian approach and does not require the user to specify any hyper-parameter value. Produces a network of products in which related items are connected to each other. These networks are easier to interpret than a list of rules. Obtains very good predictive performance. Identifies patterns whose support is too low to be identified by frequent itemset methods based on entropy measures. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 23 / 25

References Agrawal, Rakesh and Srikant, Ramakrishnan. Fast algorithms for mining association rules in large databases. In VLDB, pp. 487 499, 1994. Raeder, Troy and Chawla, Nitesh. Market basket analysis with networks. Social Network Analysis and Mining, 1:97 113, 2011. Pan, Rong and Scholz, Martin. Mind the gaps: weighting the unknown in large-scale one-class collaborative filtering. In KDD, pp. 667 676, 2009. Heckerman, David, Chickering, David Maxwell, Meek, Christopher, Rounthwaite, Robert, and Kadie, Carl. Dependency networks for inference, collaborative filtering, and data visualization. The Journal of Machine Learning Research, 1:4975, 2001. Opper, Manfred. On-line learning in neural networks. chapter A Bayesian approach to on-line learning, pp. 363 378. Cambridge University Press, New York, NY, USA, 1998. Nakajima, Shinichi, Sugiyama, Masashi, and Tomioka, Ryota. Global analytic solution for variational Bayesian matrix factorization. In NIPS, pp. 17681776, 2010. S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. BPR: Bayesian personalized ranking from implicit feedback. In UAI, pages 452461, 2009. T. De Bie. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23:407446, 2011. J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 24 / 25

Thank you for your attention! J. M. Hernández-Lobato (UC) NetBox: A Probabilistic Method for Analyzing Market Basket October Data22, 2012 25 / 25