Variable selection for model-based clustering of categorical data

Size: px

Start display at page:

Download "Variable selection for model-based clustering of categorical data"

Tracy McKenzie
6 years ago
Views:

1 Variable selection for model-based clustering of categorical data Brendan Murphy Wirtschaftsuniversität Wien Seminar, / 44

2 Alzheimer Dataset Data were collected on early onset Alzheimer patient symptoms in St. James Hospital, Dublin. Two hundred and forty patients had six behavioural and psychological symptoms (Hallucination, Activity, Aggression, Agitation, Diurnal and Affective) recorded. Number of distinct groups of patients gives an idea of the number of subclasses or syndromes. Which symptoms distinguish the groups? Can some subset better distinguish syndromes? Previous studies: difficulty determining whether two or three groups are more suitable to describe data. 2 / 44

3 Back Pain Dataset A study to investigate the use of a mechanisms-based classification of muscoloskeletal pain in clinical practice. The aim of the study was to asses the discriminative power of the taxonomy of pain in Nociceptive, Peripheral Neuropathic and Central Sensitization for low-back disorders. There are N = 464 patients who were assessed according to a list of 36 binary clinical indicators ( Present / Absent ). Some of the indicators carry the same information about the pain categories, thus the interest here is to select a subset of most relevant clinical criteria, performing a partition of the patients. Does the partition of the patients agree with the clinical taxonomy? 3 / 44

4 Clustering and Variable Selection The motivating examples show the need for: Clustering: Can we establish the existance of subgroups? How can we characterize these subgroups? Variable Selection: Can we use a subset of the variables to distinguish the subgroups? 4 / 44

5 Model-Based Clustering/Mixture Models Denote the N M data matrix by X The nth observation is denoted by X n. Model-based clustering assumes that X n arises from a finite mixture model Assuming G classes (components) τ g are mixture weights p(x n τ, θ, G) = G τ g p(x n θ g ). g=1 p(x n θ g ) is the component distribution. 5 / 44

6 Latent Class Analysis (LCA) model Latent Class Analysis (LCA) is a model for clustering categorical data. Let X n = (X n1, X n2,..., X nm ) where X nm takes a value from {1, 2,..., C m }. In LCA we assume that there is local independence between variables, so that if we knew X n was in class g we could write it s density as p(x n θ g ) = M C m m=1 c=1 θ I(Xnm=c) gmc, where {θ gm1,..., θ gmcm } give the probabilities of observing the categories {1,..., C m } in variable m θ g will characterize and embody the differences between groups 6 / 44

7 Example: Alzheimer Dataset Result for three group model from BayesLCA package (White & Murphy, 2014) Item Probabilities Groups / 44

8 LCA model (general) Model likelihood of the form, p(x n θ, τ, G) = G M C m τ g g=1 m=1 c=1 More convenient to work with completed data θ I(X nm=c) gmc. Augment data with class labels Z n = (Z n1, Z n2,..., Z ng ) where { 1 if observation n belongs to group g Z ng = 0 otherwise. Then we can write down completed data likelihood for an observation { G M } C Zng m p(x n, Z n θ, τ, G) = θ I(X nm=c) gmc. g=1 τ g m=1 c=1 8 / 44

9 LCA model (general) Estimation by EM algorithm or VB (see BayesLCA package) Note that G must be chosen in advance; possible to discriminate the best G for the data using information criteria (eg. BIC) Bayesian approaches: Pandolfi, Bartolucci and Friel (2014) use reversible jump to get posterior probability for G 9 / 44

10 Bayesian variable selection in LCA model Consider the variables that are useful for clustering in the model Let ν cl be a vector containing the indexes of the set of variables used for clustering the data ν n contain the remaining indexes This splits the observed categorical variables into those with discriminating power, and those without. 10 / 44

11 Bayesian variable selection in LCA model Then for the variables used in clustering p cl (X, Z θ, ν, τ, G) = N { G n=1 g=1 C m τ g m ν cl c=1 θ I(X nm=c) gmc } Zng... and for those not used p n (X ρ, ν) = N C m n=1 m ν n c=1 ρ I(X nm=c) mc, ρ mc is the probability of variable m having category c and is the same for all items 11 / 44

12 Bayesian variable selection in LCA model Priors are Dirichlet on the item probabilities in each class for the discriminating variables, and also in the non-discriminating variables p(θ gm β) = Γ (C mβ) Γ (β) C m p(ρ m β) = Γ (C mβ) Γ (β) Cm C m c=1 C m c=1 θ β 1 gmc. ρ β 1 mc. Prior on class probabilities also Dirichlet p(τ α, G) = Γ (Gα) Γ (α) G G g=1 τ α 1 g 12 / 44

13 Bayesian variable selection in LCA model We aim to explore uncertainty in the number of groups G and the variables used for clustering Take a prior on G also. We employ the prior of Nobile and Fearnside (2007), that was justified for this problem in a similar context p(g) 1 G! normalized over 1,..., G max In fact the work we present here, brings that of Nobile and Fearnside (2007) (for Gaussian mixtures) into the categorical data domain 13 / 44

14 Bayesian variable selection in LCA model Variables are assumed to be included a priori following a Bernoulli with parameter π p(ν π) = (1 π). m ν n m ν cl π Usually there will only be enough information to set something like π = 0.5 in a practical situation Ley and Steel (2009) investigate putting a Beta(a 0, b 0 ) hyperprior on π; we tried this but found no notable difference in results 14 / 44

15 Bayesian variable selection in LCA model If we write down the model in it s full form, we get a joint posterior on item probabilities over classes, class probabilities, labels and the number of classes: the full completed likelihood is p full (X, Z θ, ρ, ν, τ, G) = p cl (X, Z θ, ν, τ, G)p n (X ρ, ν) and posterior is p(g, Z, θ, ρ, ν, τ X, α, π, β) p full (X, Z θ, ρ, ν, τ, G) p(τ α, G)p(ν π) m ν n p(ρ m β) G g=1 p(g). m ν cl p(θ gm β) 15 / 44

16 Marginalization approach Using normalizing constants for the Dirichlet distribution it turns out that p(g, Z, ν X, α, π, β) p(g) p(z, θ, ρ, ν, τ X, G, α, π, β) dθ dρ dτ. is actually available in closed form. Instead of doing a trans-dimensional search like reversible jump algorithm, why not search over the discrete space defined by (G, Z, ν)? 16 / 44

17 Marginalization approach Doing the algebra gives p(g, Z, ν X, α, π, β) p(g)p(ν π) Γ (Gα) G g=1 Γ (N g + α) Γ (α) G Γ (N + Gα) Γ (C m β) Cm c=1 Γ (N mc + β) m ν n Γ (β) Cm Γ (N + C m β) G Γ (C m β) Cm c=1 Γ (N gmc + β) m ν cl Γ (β) Cm Γ (N g + C m β) g=1 N g is the number of observations clustered to group g, N mc is the number of times variable m takes category c, N gmc is the number of items in group g that have category c for variable m 17 / 44

18 MCMC sampling algorithm Class memberships are sampled using a Gibbs sampling step which exploits the full conditional distribution of the class label for observation n, n = 1,..., N A component is added or removed with probability 0.5 A component k is chosen at random to eject a new component from A draw u Beta(a, a) is made, and each element of the ejecting component is assigned to new component with prob u Components are removed by putting the elements of two randomly drawn clusters into a single cluster. 18 / 44

19 MCMC sampling algorithm To sample the clustering variables A variable j is chosen randomly from {1,..., M} If j ν n it is proposed to move it to ν cl. Alternatively, if j ν cl propose to move it to ν n Acceptance prob for inclusion in ν cl is min(1, R) with R = = p(g, Z, ν X, α, π, β) p(g, Z, ν X, α, π, β) ( ) G 1 Γ (C j β) G Cj c=1 Γ(N gjc + β) Γ (β) C j Γ(N g=1 g + C m β) ( Cj ) 1 c=1 Γ(N ( ) jc + β) π. Γ(N + C j β) 1 π 19 / 44

20 Label switching Because the LCA likelihood is invariant to relabelling of the components, we need to deal with the label switching problem The reason is that p(g, Z, ν X, α, π, β) = p(g, Z δ, ν X, α, π, β) where Z δ denotes the indicator matrix obtained by applying any permutation δ of 1,..., G to the columns in Z Need to post-process the samples of labels to undo any label switching that may have occurred; this has to be done to get the posterior probability of cluster membership 20 / 44

21 Post-hoc parameter estimation Use the conditional expection and variance formulae E[A] = E[E[A B]] Var[A] = E[Var[A B]] + Var[E[A B]], Let N (t) g := N n=1 Z (t) ng, S (t) gmc := N n=1 Z (t) ng I(X nm = c). Then we can estimate the expected values E[θ gmc X, β] = E[E[θ gmc X, Z, β]] 1 T E[θ gmc X, Z (t), β], T and similarly for the variance. t=1 21 / 44

22 Post-hoc parameter estimation This leads to nice formulae to estimate the posterior mean and variance of the item probabilities E[θ gmc X, β] 1 T T t=1 S gmc (t) + β N g (t) + C m β. + 1 T Var[θ gmc X, β] 1 T T t=1 ( T t=1 (S gmc (t) + β)(n g (t) + (C m 1)β S gmc) (t) + C m β) 2 (N (t) (N (t) g S (t) gmc + β N (t) g + C m β 1 T Similar formulae are available for τ g. g + C m β + 1) ) T S gmc (t) 2 + β, + C m β t=1 N (t) g 22 / 44

23 Alzheimer data The sampler was then run for 100,000 iterations, and thinned by subsampling every twentieth iterate. Alzheimer Data Log Posterior Iteration 23 / 44

24 Alzheimer data Alzheimer Data 1.0 Variables Groups 24 / 44

25 Alzheimer data The posterior probability for the number of syndromes in early onset Alzheimers where p j = Estimated posterior probability of G classes Setting for π p 2 p 3 p 4 p 5 p 6 π = π Beta(1,1.5) / 44

26 Alzheimer data Collapsed Gibbs sampler post-hoc estimates Hallucination Activity Aggression Agitation Diurnal Affective Group (0.03) 0.54 (0.06) 0.10 (0.04) 0.14 (0.06) 0.13 (0.05) 0.59 (0.08) Group (0.04) 0.80 (0.06) 0.40 (0.08) 0.64 (0.12) 0.39 (0.07) 0.94 (0.04) Full model Gibbs sampler estimates Hallucination Activity Aggression Agitation Diurnal Affective Group (0.03) 0.54 (0.06) 0.11 (0.05) 0.14 (0.06) 0.14 (0.05) 0.59 (0.08) Group (0.04) 0.79 (0.07) 0.39 (0.08) 0.64 (0.12) 0.38 (0.07) 0.93 (0.07) 26 / 44

27 Back Pain Data Physiotherapy Data Variables Groups p 7 p 8 p 9 p 10 p 11 p / 44

28 Back Pain Data / Clinical Taxonomy The clustering closely follows the clinical taxonomy, but where the groups are subdivided into subtypes. CN N PN Group Group Group Group Group Group Group / 44

29 Local Independence (A Problem?) When analyzing the back pain data, we achieved very little data reduction. In fact, only one variable was labeled as non-clustering. An explanation for this is the local independence assumption in the model. Suppose we have two variables that are highly dependent and both exhibit clustering. The variable selection method will include both variables in the model, even if one variable contains no extra clustering information. 29 / 44

30 Dean & Raftery s Greedy Search Dean & Raftery (2010) proposed a greedy stepwise variable selection algorithm for LCA. The observation vector X n is partitioned as where X n = (X C n, X P n, X O n ) X C n are the current clustering variables. X P n is proposed to be added to the clustering variables. X O n are the other variables. 30 / 44

31 Dean & Raftery s Greedy Search Two competing models are compared: M 1 z M 2 z X C X P X C X P X O X O M1 assumes that the proposed variable has clustering structure. M 2 assumes that the proposed variable has no clustering structure. This framework reduces the independence assumption of the previously described approach. 31 / 44

32 Novel Extension: Relaxing Independence Further It is unrealistic to assume that X C n and X P n are conditionally independent. We propose replacing M 2 with a different model. M 1 z M 2 z X C X P X C X R X C X P X O X O M1 assumes that the proposed variable has clustering structure. M2 assumes that the proposed variable has no clustering structure beyond that explained by the clustering variables. 32 / 44

33 Stepwise Search Algorithm We propose a stepwise search algorithm to find an optimal set of variables for clustering. The algorithm involves the following steps: Add: Add a variable to the current clustering variables. Remove: Remove a variable from the current clustering variables. Swap: Swap a proposed variable with one already in the clustering variables. Model selection is implemented using BIC. 33 / 44

34 Back Pain Data The proposed model was applied to the back pain data: Variables N. latent classes BIC ARI All All Criteria Criteria Criteria The new model achieves much greater data reduction. 34 / 44

35 Algorithm Run Iter. Proposal BIC diff. Decision Proposal BIC diff. Decision 1 Remove Crit Accepted 2 Remove Crit Accepted Swap Crit.22 with Crit Rejected 3 Remove Crit Accepted Swap Crit.25 with Crit Rejected 4 Remove Crit Accepted Swap Crit.2 with Crit Rejected 5 Remove Crit Accepted Swap Crit.29 with Crit Rejected 6 Remove Crit Accepted Swap Crit.12 with Crit Accepted 7 Remove Crit Accepted Swap Crit.26 with Crit Accepted 8 Remove Crit Accepted Swap Crit.18 with Crit Rejected 9 Remove Crit Accepted Swap Crit.7 with Crit Rejected 10 Remove Crit Accepted Swap Crit.11 with Crit Rejected 11 Remove Crit Accepted Swap Crit.8 with Crit Rejected 12 Remove Crit Accepted Swap Crit.16 with Crit Accepted 13 Remove Crit Accepted Swap Crit.10 with Crit Rejected 14 Remove Crit Accepted Swap Crit.31 with Crit Rejected 15 Remove Crit Accepted Swap Crit.18 with Crit Rejected 16 Remove Crit Accepted Swap Crit.24 with Crit Rejected 17 Remove Crit Accepted Swap Crit.32 with Crit Rejected 18 Remove Crit Accepted Swap Crit.37 with Crit Rejected 19 Remove Crit Accepted Swap Crit.9 with Crit Rejected 20 Remove Crit Accepted Swap Crit.30 with Crit Accepted 21 Remove Crit Accepted Swap Crit.37 with Crit Rejected 22 Remove Crit Accepted Swap Crit.36 with Crit Accepted 23 Remove Crit Accepted Swap Crit.1 with Crit Accepted 24 Remove Crit Accepted Swap Crit.6 with Crit Accepted 25 Remove Crit Accepted Swap Crit.20 with Crit Accepted 26 Remove Crit Rejected Swap Crit.6 with Crit Accepted 27 Remove Crit Rejected Swap Crit.37 with Crit Rejected 35 / 44

36 Clustering / Clinical Taxonomy The clustering closely follows the clinical taxonomy. Class 1 Class 2 Class 3 Nociceptive Peripheral Neuropathic Central Sensitiization It is not unusual for patients diagnosed as Nociceptive may have Peripheral Neuropathic aspects to their back pain. 36 / 44

37 Clustering Variables The selected variables exhibit strong clustering across the three groups. Class Class Class Crit.2 Crit.5 Crit.8 Crit.9 Crit.13 Crit.15 Crit.19 Crit.26 Crit.28 Crit.33 Crit.37 Selected criteria 37 / 44

38 Chosen Variables with Descriptions The chosen variables have the following descriptions. Crit. Description Class 1 Class 2 Class 3 2 Pain associated to trauma, pathologic process or dysfunction Usually intermittent and sharp with movement/mechanical provocation Pain localized to the area of injury/dysfunction Pain referred in a dermatomal or cutaneous distribution Disproportionate, nonmechanical, unpredictable pattern of pain Pain in association with other dysesthesias Night pain/disturbed sleep Pain in association with high levels of functional disability Clear, consistent and proportionate pattern of pain Diffuse/nonanatomic areas of pain/tenderness on palpation Pain/symptom provocation on palpation of relevant neural tissues / 44

39 Discarded Variables Many of the discarded variables are related with the clustering variables. Discarded criteria Crit.38 Crit.36 Crit.35 Crit.34 Crit.32 Crit.31 Crit.30 Crit.29 Crit.27 Crit.25 Crit.24 Crit.23 Crit.22 Crit.20 Crit.18 Crit.16 Crit.14 Crit.12 Crit.11 Crit.10 Crit.7 Crit.6 Crit.4 Crit.3 Crit.1 Crit.2 Crit.5 Crit.8 Crit.9 Crit.13 Crit.15 Crit.19 Crit.26 Crit.28 Crit.33 Crit.37 Selected criteria These are not clustering variables because they don t exhibit clustering beyond what can be explained by the clustering variables. 39 / 44

40 Summary Model-based approaches to clustering and variable selection achieve excellent performance. The collapsed MCMC scheme explores the model space effectively. Removing independence assumptions in the model achieves improved variable selection. Care needed interpreting the chosen/discarded variables. 40 / 44

41 Simulation 1 z X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X / 44

42 Simulation 1 Results X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X / 44

43 Simulation 2 z X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X / 44

44 Simulation 2 Results X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X / 44

Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian