Applying cluster analysis to 2011 Census local authority data

Similar documents
National Statistics 2001 Area Classifications

Marielle Caccam Jewel Refran

2. Sample representativeness. That means some type of probability/random sampling.

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

2. Sample representativeness. That means some type of probability/random sampling.

Creating a Geodemographic Classification

Data Preprocessing. Cluster Similarity

Data Exploration and Unsupervised Learning with Clustering

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Cluster Analysis CHAPTER PREVIEW KEY TERMS

Multivariate Analysis Cluster Analysis

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Multivariate Statistics

Using SPSS for One Way Analysis of Variance

Multivariate Statistics: Hierarchical and k-means cluster analysis

Machine Learning - MT Clustering

Geographical Inequalities and Population Change in Britain,

STATISTICA MULTIVARIATA 2

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

Classification methods

University of Florida CISE department Gator Engineering. Clustering Part 1

Unsupervised machine learning

Chapter 5-2: Clustering

ANALYSIS OF SURVEY DATA USING SPSS

ESRI 2008 Health GIS Conference

AS Population Change Question spotting

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

An introduction to clustering techniques

More on Unsupervised Learning

Multivariate Analysis

Mapping Welsh Neighbourhood Types. Dr Scott Orford Wales Institute for Social and Economic Research, Data and Methods WISERD

Clustering using Mixture Models

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

Overview of clustering analysis. Yuehua Cui

Time Series Classification

Module Master Recherche Apprentissage et Fouille

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Advanced Quantitative Data Analysis

A Comprehensive Method for Identifying Optimal Areas for Supermarket Development. TRF Policy Solutions April 28, 2011

Identifying, mapping and modelling trajectories of neighbourhood poverty in metropolitan areas: The case of Montreal

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Models for Count and Binary Data. Poisson and Logistic GWR Models. 24/07/2008 GWR Workshop 1

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Factor Analysis. Summary. Sample StatFolio: factor analysis.sgp

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Regression Clustering

Data Matrix User Guide

Ethnic and socioeconomic segregation in Belgium A multi-scalar approach using individualised neighbourhoods

Understanding Your Community A Guide to Data

Description Remarks and examples Reference Also see

The Church Demographic Specialists

Clustering and Gaussian Mixture Models

Applied Hierarchical Cluster Analysis with Average Linkage Algoritm

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

DIMENSION REDUCTION AND CLUSTER ANALYSIS

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

MLMED. User Guide. Nicholas J. Rockwood The Ohio State University Beta Version May, 2017

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Population health across space & time: geographical harmonisation of the ONS Longitudinal Study for England & Wales

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

CS626 Data Analysis and Simulation

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Chapter 5: Microarray Techniques

Summary and Implications for Policy

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Review of Multiple Regression

Data Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1

Clustering VS Classification

Spatial Organization of Data and Data Extraction from Maptitude

Different Displays of Thematic Maps:

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

ANALELE ŞTIINŢIFICE ALE UNIVERSITĂŢII ALEXANDRU IOAN CUZA DIN IAŞI Tomul LII/LIII Ştiinţe Economice 2005/2006

Data Mining 4. Cluster Analysis

Typical information required from the data collection can be grouped into four categories, enumerated as below.

UK Data Archive Study Number British Social Attitudes, British Social Attitudes 2008 User Guide. 2 Data collection methods...

NAG Library Chapter Introduction. G08 Nonparametric Statistics

An Open Source Geodemographic Classification of Small Areas In the Republic of Ireland Chris Brunsdon, Martin Charlton, Jan Rigby

Part 7: Glossary Overview

Taking into account sampling design in DAD. Population SAMPLING DESIGN AND DAD

USING CENSUS OUTPUT AREAS FOR MARKET RESEARCH SAMPLING

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Modern Information Retrieval

Exploratory Spatial Data Analysis Using GeoDA: : An Introduction

Profiling Burglary in London using Geodemographics

How rural the EU RDP is? An analysis through spatial funds allocation

Multivariate analysis of genetic data: exploring groups diversity

Task 1: Open ArcMap and activate the Spatial Analyst extension.

Energy Use in Homes 2007

Module 10 Summative Assessment

OECD QSAR Toolbox v.4.0. Tutorial on how to predict Skin sensitization potential taking into account alert performance

Transcription:

Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017

Outline Basic ideas of cluster analysis How to choose variables How to apply hierarchical and non hierarchical clustering Select the number of clusters to be formed Cluster Profiling

What is cluster analysis? Cluster analysis partitions the sample into groups on the basis of attributes that make them similar. The groups are mutually exclusive so that they best represent distinct sets of observations within the sample. Minimizes within-group variance and maximizes between-group variance. Plotted geometrically objects within clusters should be close together.

What is cluster analysis? a) Objects b) Two clusters c) Four clusters d) Six clusters

Measuring similarity Similarity represents the degree of correspondence among objects across dimensions. Inter-object similarity is measured by distance between pairs of objects. Euclidean distance Squared Euclidean distance: Euclidean distance without square root. City-block (Manhattan) distance: sum of the variables absolute differences. Chebychev distance: the maximum absolute difference in values for any variable.

How do we form clusters? Hierarchical (agglomerative) methods start with each observation as a cluster then combine observations sequentially to form clusters until there is only one large cluster. Non-hierarchical methods assign observations into clusters once the number of clusters is specified so that each observation belongs to a cluster with the nearest mean. Combination of methods (e.g. Two-step cluster analysis).

Hierarchical and non-hierarchical CA Hierarchical method Advantages: Generates cluster solutions from 1 to n. Can handle different types of variables. Disadvantages: Long computation (agglomerative); not suitable for large samples. To remove outliers need to re-run the cluster analysis several times. Rigid once observations are assigned to a cluster they cannot move to another. Non-hierarchical (K-Means) method Advantages: Provides clusters that satisfy some optimality criterion; less sensitive to outliers and irrelevant variables; quick computation time for large datasets, allows objects to move from one cluster to another. Disadvantages: Define K in advance; cannot get range of solutions at the same time; sensitive to the choice of initial cluster centres.

Which variables? Variable selection should be guided by theoretical, conceptual and practical considerations. Choose variables that best describe the similarity between objects that are relevant to the research problem. Sample size should be large enough to adequately represent the groups. Start with a large number of variables and reduce to a subset of the most relevant variables that are likely to lead to the most optimal solution.

Which variables? Variables can be nominal, ordinal, scale or combination. Variables showing sufficient variation. Variables should not exhibit very high correlation with other variables (unless unequal weighting is desired). Variables with outliers are problematic. Variables may need to be standardised if the range or scale of one variable is much larger or different from the others. Variables may need preparation(e.g. cleaning data, removing outliers) and transformations for normalisation purposes (standardandisation).

Hierarchical clustering Single linkage (Nearest neighbor): shortest distance between objects. Pros: good for natural clusters that are not spherical or eliptical. Cons: poorly delineated cluster structures within the data can result in snakelike chains for clusters. Complete linkage (Furthest neighbor): largest distance between objects. Pros: produces similar sized clusters Cons: sensitive to outliers. Average linkage (Between groups): average distance between objects. Pros: robust; less affected by outliers; generates clusters with small within-cluster variation. Ward's method: sum of squares within clusters, summed over all variables. Pros: produces similar sized clusters. Cons: sensitive to outliers; cannot use distance measures other than Squared Euclidean distance. Centroid linkage: distance between cluster centroids. Pros: robust; less sensitive to outliers. Cons: cannot use distances measures other than Squared Euclidean distance.

How many clusters? There is no generally accepted procedure for determining the cluster solution. Cluster analysis will always result in a solution but it may not be meaningful or relevant. Decide on validity of groupings, theoretical justification and practicality of results (Can you interpret the results? Are they meaningful?). Use inter-cluster distances. Between-groups sum of squares or likelihood can be plotted against the number of clusters in a dendrogram.

Cluster Profiling Cluster profiling involves calculating the mean values for each cluster. The profiles are usually based on the cluster variables (used to form the clusters). They can also involve external variables e.g. demographic, socio-economic variables not included in the cluster analysis. Interpreting or labeling the clusters will involve examination of the cluster profiles and deciding whether they are meaningful. In geodemographics examining the spatial distribution of clusters enhances understanding about the clusters (helps with cluster naming).

Hierarchical cluster analysis in SPSS

Dataset description Variables drawn from the 2011 Census for 348 local authorities in England and Wales to examine area deprivation Households with an occupancy rating -1or less Households with no central heating Lone parent households Households with no car or van People aged 16 to 74 with no qualifications People aged 16 to 74 unemployed People with LLTI Households in social housing People who belong to an ethnic minority group other than White British People aged 16 to 74 in semi-routine occupations People aged 16 to 74 in routine occupations

To select a random sample go to Data Select Cases

From the main menu of SPSS select Analyze Classify Hierarchical Cluster.

Select the variables in the Variable(s) box Nocarorvan; Unemployed; PeoplewithLLTI; Socialhousing; Ethnicminorities; Routinesemiroutine Next click on Plots Specify how you wish your cases to be identified e.g. ID number. Here we choose LAname

In Plots tick Dendrogram. The next option is for the Icicle. It is generally more difficult to interpret so tick None.

Cluster Method. Betweengroup linkage computes the smallest average distance between all group pairs and combines the two groups that are closest. Measure. This is the method used for measuring the distance between clusters. Interval measure gives us dissimilarity and similarity measures for interval data. The Squared Euclidean distance gives the squared differences between the values for the cases. Transform values Select Z scores and By variable in Standardize

Ward s Method: all possible pairs of clusters are combined and the sum of the squared distances within each cluster is calculated. This is then summed over all clusters. The combination that gives the lowest sum of squares is chosen.

Select Save to save cluster memberships: None (default) doesn t save the solution. Single Solution saves the cluster membership for a specified solution. Range of Solutions will save more than one variable.

Vertical lines in the Dendrogram represent clusters that are joined together at each stage. The position of the line on the scale shows the distances at which clusters were joined.

Examine the agglomeration schedule to find the cluster solution The solution before the jump indicates the good solution. Rewrite the agglomeration table with the change in coefficients. Coefficients No of clusters Last This Change step step 2 294 191.8 102.1 3 191.8 132.3 59.4 4 132.3 107.7 24.5 5 107.7 86.7 21 6 86.7 73.8 12.8 7 73.8 61.2 12.5

You can examine the cluster membership for each of the solutions. Go to the main menu of SPSS select Analyze Descriptive Statistics Frequencies Select the cluster membership variables CLU5_2 CLU4_2 CLU3_2 CLU2_2 in Variable(s) and Click OK.

Number of observations in each cluster. Is a cluster with 1 observation useful?

K Means cluster analysis in SPSS

Analyze Descriptive Statistics Descriptives

Analyze Classify K-Means Cluster

Select the standardized variables to the Variables list Label Cases by LAname In Number of Clusters enter 5 In Method select Iterate and Classify

Save new variables showing Cluster Membership and Distance from cluster center for each object

The Iteration History shows the number of times SPSS passed through the data before finding stable clusters. After the sixth iteration there was no measurable change in cluster centres, and SPSS decided that these were the final centres.

Cluster 4: above average unemployment, LLTI, social housing and routine occupations but lower levels of people from ethnic minorities. Cluster 5: above average ethnic minorities and people with no car or van but below average LLTI and routine occs (London?). Cluster 1: above average levels of unemployment and people from ethnic minorities. Cluster 2: below average values on nearly all variables (affluence?). Cluster 3: mixture of low and high values (socio-economic diversity?).

Which cluster or group of local authorities is the most deprived? Which cluster or group of local authorities is the least deprived? How would you describe the other three clusters?

The Cluster Membership table (first 35 cases shown here) shows the distance of each case from the cluster centre, indicating how typical the case is of its cluster.

Distances between final cluster centres and the number of cases within each cluster. Largest clusters are cluster 2 and cluster 4. Cluster 5 has only 6 cases.

https://www.cmist.manchester.ac.uk/study/short/introductory/cluster-analysis/