MODEL BASED CLUSTERING FOR COUNT DATA

Size: px

Start display at page:

Download "MODEL BASED CLUSTERING FOR COUNT DATA"

Edmund Henry
5 years ago
Views:

1 MODEL BASED CLUSTERING FOR COUNT DATA Dimitris Karlis Department of Statistics Athens University of Economics and Business, Athens April

2 OUTLINE Clustering methods Model based clustering!"the general model!"algorithmic issues!"inference!"multivariate Normal clustering Multivariate Poisson distributions Model Based clustering via the Multivariate Poisson distribution!"the models!"estimation Application to marketing data!"the data!"the model!"results Generalizing the model Future research and open problems

3 CLUSTERING Purpose: To find observations with similar characteristics Other names: taxonomy, segmentation etc Methods used Hierarchical Clustering!"Large storage demands!"dependence on the distance measure and the linkage method!"not solid theoretical background K- Means!"Heuristic algorithm!"computationally feasible!"dependence on the initial solution!"not solid theoretical background. Model Based Clustering!"Probabilistic method!"strong theoretical background!"inferential procedures available

4 MODEL BASED CLUSTERING The population consists of k subpopulations For the q-dimensional observation y i from the j-th subpopulation: yi ~ f( yi θ j) (θ j unknown vector of parameters ) Unconditional density : f ( y i ) = k j= p j f ( y i θ j ) Problem: Finding the values of the non-observable vector ϕ = (φ, φ,, φ n ) φ i = j if the i-th observation belongs to the j-th subpopulation. Define the vector z ij = φ φ i i = j j

5 ESTIMATION The purpose of model-based clustering is to estimate the parameters Θ= ( p,..., p, θ,..., θ ). k k Loglikelihood: L( y; θ, p) n k = ln i= j= p j f ( y i θ j ) Solution: EM algorithm

6 EM ALGORITHM FOR MODEL BASED CLUSTERING Observed data y i Complete data ( y, z ) i i E-step: Estimate w = E( z y, Θ ) ij ij i M-step: Update Θ by solving appropriate equations (usually weighted likelihood equations with weights w ij ) Variants: Clasiffication EM (CEM)

7 THE EM ALGORITHM (Dempster et al, 975, Meng and Van Dyk, 997, McLachlan and Krishnan, 997) Z = Complete Data, ) X observed data i i X Y ( i i Y missing data (or they can be considered as missing) i E-step Compute k ) Q( ϕ ϕ ) = E(log p( Z ( ϕ) X, ϕ ( k ) ) M-step Update ( k + ) ) ϕ by maximizing Q( ϕ ( k ) ϕ with respect to ϕ

8 MODEL BASED CLUSTERING VIA MULTIVARIATE NORMAL MIXTURES (Barnfield and Raftery, 993) Assume y i ~ f(x θ j ) is MN(μ j,σ j ) μ j mean vector Σ j covariance matrix Wide range of applications (see, McLachlan and Basford, 989, McLAchlan and Peel, etc)

9 INTERESTING FEATURES Σ = λ DAD k k k k ' λ k volume A k shape D k orientation Model Covariance matrix of the k- th component Σ k Shape Orientation Volume λ I Spherical None Same λ I k Spherical None Different 3 Σ Same Same Same 4 λkσ Same Same Different 5 DAD k k ' 6 kdad k k ' λ Same Different Same λ Same Different Different 7 λ DAD k ' Different Same Different 8 Σ k Different Different Different

10 NUMBER OF COMPONENTS Criteria: Information criteria!"aic AIC = -L k + d k!"bic BIC = L + d ln( n)!"aic3, CAIC, JAAIC!"Information Complexity Criteria!"Minimum Information Ratio Criterion k k Cross-validated Criteria Classification Based Criteria!"Normalized Entropy Criterion Bootstrap based criteria!"bootstrap LRT

11 INFERENTIAL PROBLEMS General ML Theorem: The number of components that can be fitted in any dataset is finite and it depends on the data. Usually we are able to estimate a small number of components Other Topics Goodness of fit Standard errors Classifying new observations

12 MULTIVARIATE COUNT DATA Examples Number of different crimes in different areas/time Number of occurrence of different diseases in different areas/time Number of different transaction for a bank account Number of purchases of different products Common elements correlation between the variables the data are counts, usually with large number of zero, so multivariate normal approach inappropriate Approach Proposed: Model based clustering based on the multivariate Poisson distribution

13 MULTIVARIATE POISSON DISTRIBUTION X ~ Poisson( λ ), i=,,m i i X i s are independent Multivariate Poisson distributions (MP) defined as Y=AX, A is a q x m matrix of zeros and ones.

14 Example. m=3, q= X = ( X, X, X3) A = Y = ( Y, Y ) Y = X + X 3 Y = X + X 3 (Trivariate reduction technique) #"Important result: The marginal distributions are Poisson distributions

15 GENERAL MODEL Suppose that the matrix A has the form A = [ A A... ] where A i is a matrix of dimensions q where the columns of the matrix are all the combinations containing exactly i ones and q-i zeros. A m q i E(Y) = AM and Var(Y) = AΣA T where M = E(X) = (λ, λ,, λ m ) and Σ is the variance/covariance matrix of X and is given as: Σ = Var(X) = diag(λ, λ,, λ m )

16 Example For instance, for q=3, we need = A = A = A 3 Y = X + X + X 3 + X 3 Y = X + X + X 3 + X 3 Y 3 = X 3 + X 3 + X 3 + X 3 where the form of the matrix A is now: = A and X = (X, X, X 3, X, X 3, X 3, X 3 ).

17 SOME INTERESTING THINGS Calculation of the probability function computationally demanding Estimation via EM algorithm Use of the entire covariance structure of little practical interest Very few applications, usually assuming just a common covariance term Even if we start with conditionally independent Poisson distribution the resulting mixture has covariance structure

18 COVARIANCE STRUCTURE X λ ~ Poisson( λ ), i=,,m, X i s are independent i i i Define Y=AX, where A is a q x m matrix of zeros and ones. Let θ = ( λ, λ,..., λ m ) the vector of parameters. Then, Y θ ~ MultPoisson( θ ) θ ~ G( θ ), The unconditional variance of Y is given as Var( Y ) = ADA' where D Var( λ) + E( λ) Cov( λ, λ)... Cov( λ, λm) Cov( λ, λ ) Var( λ ) E( λ )... Cov( λ, λ ) Cov( λ, λm) Var( λm) + E( λm) m = D= Var( θ ) + B, B E( λ )... E( λ ) E( λm) =

19 Important things The unconditional covariance can be decomposed in two parts: the intrinsic covariance and the covariance due to mixing Even if the variables are conditionally uncorrelated, unconditionally there is covariance Resulting covariance can be negative as well (multivariate Poisson model has necessarily positive covariance)

20 APPLICATION (Brijs, Karlis, et al. ) The data 55 customers of a large super market chain in Belgium. Purchase of 4 products were of interest!" Cakemix!" Frosting!" Softener!" Detergent Mean Variance Variance/mean Cakemix Frosting Detergent Softener Univariate Clustering Product Number of components Softener Detergent 3 Frosting Cakemix 3

21 cakemix Frequency frosting Frequency detergent Frequency softener Frequency cakemix frosting detergent softener

22 THE MODEL $" Not all the covariances were used $" Restricted covariance structure Important covariances Frostings and cakemix Detergent and softener (r=.66) (r=.48) The model Latent variables X = (X C, X F, X D, X S, X CF, X DS ) The form of the matrix A is now: A = the vector of parameters is now θ = (λ C, λ F, λ D, λ S, λ CF, λ DS ). Thus we have Y C = X C + X CF Y F = X F + X CF Y D = X D + X DS Y S = X S + X DS

23 Conditional probability function Py ( θ) = Py (, y, y, y θ) C F D S = BP( y, y ; λ, λ, λ ) BP( y, y ; λ, λ, λ ) C F C F CF D S D S DS where BP( y λ y λ y min( y, y ) e λc e λ y y, y; λ, λ, λ) = i y y i i i!!! = λ λ λ i with y,y =,,... The unconditional probability mass function is given under a mixture with k-components model by k Py ( ) = ppy j ( C, yf, yd, ys θ j) j=

24 Estimation for the mixture model!"for a model with k components the number of parameters = 7k-.!"The likelihood function is quite complicated for direct maximization.!"em type of algorithm is used.

25 THE EM ALGORITHM Observed data Y i = (Y Ci, Y Fi, Y Di, Y Si ) Unobserved data X i = (X Ci, X Fi, X Di, X Si, X CFi, X DSi ) and Z i = (Z i, Z i,, Z ki ) Complete Data the vectors (Y i, X i, Z i ). Θ : the vector of parameters,

26 E-step: Using the current values of parameters calculate w E( Z Y, ) ppy ( θ ) j i j ij = ji i Θ =, i=, n, j=, k Py ( i ) x = E( X Y, Θ ) = p BP( y, y ; λ, λ, λ ) CFi CF, i i j Di Si Dj Sj DSj j= min( yc, yf) r= k rpoy ( Ci r λcj ) Poy ( Fi r λfj ) Por ( λcfj ), Py ( ) x = E( X Y, Θ ) = p BP( y, y ; λ, λ, λ ) DSi DS, i i j Ci Fi Cj Fj CFj j= min( yd, ys) r= x x Ci k i rpoy ( Di r λdj ) Poy ( Si r λsj ) Por ( λdsj ), Py ( ) Di = y = y Ci Di x x CFi DSi,, x x Fi Si i = y = y Fi Si x x CFi Dsi,, M-step: Update the parameters n w x ij i= ij = n i= w ij ji p n w ij i= j =, j=, k n i { C F D S CF DS λ, j=, k,,,,,, } If some convergence criterions is satisfied, stop iterating, otherwise go back to the E-step.

27 Issues related to the EM Lk ( + ) Lk ( ) Lk ( ) < $" Stopping criterion: $" Multiple maxima Step : sets of initial values are chosen randomly Step : Each set run for iterations and Step 3: We keep iterating from the solution with the largest likelihood after the initial iterations The above procedure was run times. Check via the gradient function

28 Scalability:!"Working with frequency tables, not with original data speed up the process considerably.!"appropriate for huge databases with million of customers!"appropriate for Data mining!"speed depends on $" the number of components $" the separation between components and $" the relative size of the parameters

29 Number of components Number of parameters loglikelihood AIC BIC Other criteria used: NEC, MIRC, similar results

30 λ λ λ 3 λ 4 Component K=5 λ CF λ DS p K=

31 Bivariate Poisson probability function with parameter vector,4,

32 Probability function of a trivariate Poisson distribution with full covariance structure min( y, y) min( y x, y3) min( y x, y3 x3) Py (, y, y3) = exp λ j x = x3 = x3 = j A y x x3 y x x3 y3 x3 x3 λ λ λ, y x x! y x x! y x x! ( ) ( ) ( ) A = {,,3,,3,4}

33 - loglikelihood Loglik -5 - AIC BIC -5 5 k 5 The Loglikelihood and the AIC and BIC criteria for our dataset

34 . prob k The mixing proportions for a wide range of models with varying number of clusters.

35 Customers in the cluster Cluster Centers cakemix frosting detergen Softener

36 - 3 loglik number of components 5 - model described, -model with covariance between cakemix - detergent and softener -frosting 3- common covariance 4- no covariance model

37 frosting cakemix

38 lambda.4 lambda lambda lambda..5 8 lambda6. lambda lambda lambda4

39 QUALITY OF THE FITTED MODEL Entropy criterion: I ( k) = n k w ij ln( w ij ) i= j= nln(/ k) with the convention that w ij ln(w ij ) = if w ij =. Perfect classification values near Our solution.83 Criterion can be used to select the number of components

40 AN INTERESTING CASE Full covariance structure A [ A, A ]......! "! " # # = = Simple multivariate Poisson distribution (no mixture) Parameters to be estimated λ λ λ " λm λm λm!"off-diagonal elements are covariances!"probability function too complicated!"use of recursive relationships (problematic for 5 or more dimensions)!"latent structure can be used to derive an EM!"EM does not need the calculation of the probability function

41 ONGOING RESEARCH EM algorithm for the general model Extension to the finite mixture model for clustering purposes Bayesian Approach: same data augmentation helps Generalization of the results in order to be applicable (and computationally feasible for large dimensions)

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.