Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017
Outline Basic ideas of cluster analysis How to choose variables How to apply hierarchical and non hierarchical clustering Select the number of clusters to be formed Cluster Profiling
What is cluster analysis? Cluster analysis partitions the sample into groups on the basis of attributes that make them similar. The groups are mutually exclusive so that they best represent distinct sets of observations within the sample. Minimizes within-group variance and maximizes between-group variance. Plotted geometrically objects within clusters should be close together.
What is cluster analysis? a) Objects b) Two clusters c) Four clusters d) Six clusters
Measuring similarity Similarity represents the degree of correspondence among objects across dimensions. Inter-object similarity is measured by distance between pairs of objects. Euclidean distance Squared Euclidean distance: Euclidean distance without square root. City-block (Manhattan) distance: sum of the variables absolute differences. Chebychev distance: the maximum absolute difference in values for any variable.
How do we form clusters? Hierarchical (agglomerative) methods start with each observation as a cluster then combine observations sequentially to form clusters until there is only one large cluster. Non-hierarchical methods assign observations into clusters once the number of clusters is specified so that each observation belongs to a cluster with the nearest mean. Combination of methods (e.g. Two-step cluster analysis).
Hierarchical and non-hierarchical CA Hierarchical method Advantages: Generates cluster solutions from 1 to n. Can handle different types of variables. Disadvantages: Long computation (agglomerative); not suitable for large samples. To remove outliers need to re-run the cluster analysis several times. Rigid once observations are assigned to a cluster they cannot move to another. Non-hierarchical (K-Means) method Advantages: Provides clusters that satisfy some optimality criterion; less sensitive to outliers and irrelevant variables; quick computation time for large datasets, allows objects to move from one cluster to another. Disadvantages: Define K in advance; cannot get range of solutions at the same time; sensitive to the choice of initial cluster centres.
Which variables? Variable selection should be guided by theoretical, conceptual and practical considerations. Choose variables that best describe the similarity between objects that are relevant to the research problem. Sample size should be large enough to adequately represent the groups. Start with a large number of variables and reduce to a subset of the most relevant variables that are likely to lead to the most optimal solution.
Which variables? Variables can be nominal, ordinal, scale or combination. Variables showing sufficient variation. Variables should not exhibit very high correlation with other variables (unless unequal weighting is desired). Variables with outliers are problematic. Variables may need to be standardised if the range or scale of one variable is much larger or different from the others. Variables may need preparation(e.g. cleaning data, removing outliers) and transformations for normalisation purposes (standardandisation).
Hierarchical clustering Single linkage (Nearest neighbor): shortest distance between objects. Pros: good for natural clusters that are not spherical or eliptical. Cons: poorly delineated cluster structures within the data can result in snakelike chains for clusters. Complete linkage (Furthest neighbor): largest distance between objects. Pros: produces similar sized clusters Cons: sensitive to outliers. Average linkage (Between groups): average distance between objects. Pros: robust; less affected by outliers; generates clusters with small within-cluster variation. Ward's method: sum of squares within clusters, summed over all variables. Pros: produces similar sized clusters. Cons: sensitive to outliers; cannot use distance measures other than Squared Euclidean distance. Centroid linkage: distance between cluster centroids. Pros: robust; less sensitive to outliers. Cons: cannot use distances measures other than Squared Euclidean distance.
How many clusters? There is no generally accepted procedure for determining the cluster solution. Cluster analysis will always result in a solution but it may not be meaningful or relevant. Decide on validity of groupings, theoretical justification and practicality of results (Can you interpret the results? Are they meaningful?). Use inter-cluster distances. Between-groups sum of squares or likelihood can be plotted against the number of clusters in a dendrogram.
Cluster Profiling Cluster profiling involves calculating the mean values for each cluster. The profiles are usually based on the cluster variables (used to form the clusters). They can also involve external variables e.g. demographic, socio-economic variables not included in the cluster analysis. Interpreting or labeling the clusters will involve examination of the cluster profiles and deciding whether they are meaningful. In geodemographics examining the spatial distribution of clusters enhances understanding about the clusters (helps with cluster naming).
Hierarchical cluster analysis in SPSS
Dataset description Variables drawn from the 2011 Census for 348 local authorities in England and Wales to examine area deprivation Households with an occupancy rating -1or less Households with no central heating Lone parent households Households with no car or van People aged 16 to 74 with no qualifications People aged 16 to 74 unemployed People with LLTI Households in social housing People who belong to an ethnic minority group other than White British People aged 16 to 74 in semi-routine occupations People aged 16 to 74 in routine occupations
To select a random sample go to Data Select Cases
From the main menu of SPSS select Analyze Classify Hierarchical Cluster.
Select the variables in the Variable(s) box Nocarorvan; Unemployed; PeoplewithLLTI; Socialhousing; Ethnicminorities; Routinesemiroutine Next click on Plots Specify how you wish your cases to be identified e.g. ID number. Here we choose LAname
In Plots tick Dendrogram. The next option is for the Icicle. It is generally more difficult to interpret so tick None.
Cluster Method. Betweengroup linkage computes the smallest average distance between all group pairs and combines the two groups that are closest. Measure. This is the method used for measuring the distance between clusters. Interval measure gives us dissimilarity and similarity measures for interval data. The Squared Euclidean distance gives the squared differences between the values for the cases. Transform values Select Z scores and By variable in Standardize
Ward s Method: all possible pairs of clusters are combined and the sum of the squared distances within each cluster is calculated. This is then summed over all clusters. The combination that gives the lowest sum of squares is chosen.
Select Save to save cluster memberships: None (default) doesn t save the solution. Single Solution saves the cluster membership for a specified solution. Range of Solutions will save more than one variable.
Vertical lines in the Dendrogram represent clusters that are joined together at each stage. The position of the line on the scale shows the distances at which clusters were joined.
Examine the agglomeration schedule to find the cluster solution The solution before the jump indicates the good solution. Rewrite the agglomeration table with the change in coefficients. Coefficients No of clusters Last This Change step step 2 294 191.8 102.1 3 191.8 132.3 59.4 4 132.3 107.7 24.5 5 107.7 86.7 21 6 86.7 73.8 12.8 7 73.8 61.2 12.5
You can examine the cluster membership for each of the solutions. Go to the main menu of SPSS select Analyze Descriptive Statistics Frequencies Select the cluster membership variables CLU5_2 CLU4_2 CLU3_2 CLU2_2 in Variable(s) and Click OK.
Number of observations in each cluster. Is a cluster with 1 observation useful?
K Means cluster analysis in SPSS
Analyze Descriptive Statistics Descriptives
Analyze Classify K-Means Cluster
Select the standardized variables to the Variables list Label Cases by LAname In Number of Clusters enter 5 In Method select Iterate and Classify
Save new variables showing Cluster Membership and Distance from cluster center for each object
The Iteration History shows the number of times SPSS passed through the data before finding stable clusters. After the sixth iteration there was no measurable change in cluster centres, and SPSS decided that these were the final centres.
Cluster 4: above average unemployment, LLTI, social housing and routine occupations but lower levels of people from ethnic minorities. Cluster 5: above average ethnic minorities and people with no car or van but below average LLTI and routine occs (London?). Cluster 1: above average levels of unemployment and people from ethnic minorities. Cluster 2: below average values on nearly all variables (affluence?). Cluster 3: mixture of low and high values (socio-economic diversity?).
Which cluster or group of local authorities is the most deprived? Which cluster or group of local authorities is the least deprived? How would you describe the other three clusters?
The Cluster Membership table (first 35 cases shown here) shows the distance of each case from the cluster centre, indicating how typical the case is of its cluster.
Distances between final cluster centres and the number of cases within each cluster. Largest clusters are cluster 2 and cluster 4. Cluster 5 has only 6 cases.
https://www.cmist.manchester.ac.uk/study/short/introductory/cluster-analysis/