Surnames as Indicators of Cultural and Linguistic Regions in Europe.

Similar documents
Drift Inflates Variance among Populations. Geographic Population Structure. Variance among groups increases across generations (Buri 1956)

Applying cluster analysis to 2011 Census local authority data

A Modified DBSCAN Clustering Method to Estimate Retail Centre Extent

Online Appendix for Cultural Biases in Economic Exchange? Luigi Guiso Paola Sapienza Luigi Zingales

STATISTICA MULTIVARIATA 2

Lecture 9: Location Effects, Economic Geography and Regional Policy

What are we like? Population characteristics from UK censuses. Justin Hayes & Richard Wiseman UK Data Service Census Support

Coastal regions: People living along the coastline and integration of NUTS 2010 and latest population grid

Drawing the European map

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

World Industrial Regions

Populating urban data bases with local data

Open Data Sources for Domain Specific Geodemographics

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

The School Geography Curriculum in European Geography Education. Similarities and differences in the United Europe.

Ethnic and socioeconomic segregation in Belgium A multi-scalar approach using individualised neighbourhoods

Millennium Cohort Study:

Hennig, B.D. and Dorling, D. (2014) Mapping Inequalities in London, Bulletin of the Society of Cartographers, 47, 1&2,

Geographical Inequalities and Population Change in Britain,

Refinement of the OECD regional typology: Economic Performance of Remote Rural Regions

Spatial concentrations of surnames in Great Britain

F M U Total. Total registrants at 31/12/2014. Profession AS 2, ,574 BS 15,044 7, ,498 CH 9,471 3, ,932

Compact guides GISCO. Geographic information system of the Commission

International Economic Geography- Introduction

Annotated Exam of Statistics 6C - Prof. M. Romanazzi

Economic and Social Council

Bathing water results 2011 Latvia

Creating a Geodemographic Classification

2010 Oracle Corporation 1

Using Social Media for Geodemographic Applications

AD HOC DRAFTING GROUP ON TRANSNATIONAL ORGANISED CRIME (PC-GR-COT) STATUS OF RATIFICATIONS BY COUNCIL OF EUROPE MEMBER STATES

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

WORKING PAPERS SERIES

Territorial evidence for a European Urban Agenda TOWN in Europe

Population health across space & time: geographical harmonisation of the ONS Longitudinal Study for England & Wales

The ESPON Programme. Goals Main Results Future

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the European Region

Bathing water results 2011 Slovakia

EuroGeoSurveys An Introduction

Challenges in Geocoding Socially-Generated Data

Shetland Islands Council

How rural the EU RDP is? An analysis through spatial funds allocation

COSMIC: COmplexity in Spatial dynamic

The European regional Human Development and Human Poverty Indices Human Development Index

ESPON evidence on European cities and metropolitan areas

Alleghany County Schools Curriculum Guide GRADE/COURSE: World Geography

Gridded population data for the UK redistribution models and applications

An Open Source Geodemographic Classification of Small Areas In the Republic of Ireland Chris Brunsdon, Martin Charlton, Jan Rigby

INTELLIGENT GENERALISATION OF URBAN ROAD NETWORKS. Alistair Edwardes and William Mackaness

WHO EpiData. A monthly summary of the epidemiological data on selected vaccine preventable diseases in the European Region

Figure 10. Travel time accessibility for heavy trucks

Merging statistics and geospatial information

INSPIRing effort. Peter Parslow Ordnance Survey December Various European approaches to managing an SDI

UNIT 4: POLITICAL ORGANIZATION OF SPACE

Understanding and accessing 2011 census aggregate data

Weighted Voting Games

GREAT BRITAIN: INDUSTRIAL REVOLUTION TO 1851 Student Worksheet

The Combination of Geospatial Data with Statistical Data for SDG Indicators

Profiling Burglary in London using Geodemographics

Defining Metropolitan Regions (MRs): coping with complexity

Exploring Digital Welfare data using GeoTools and Grids

The National Spatial Strategy

USING DOWNSCALED POPULATION IN LOCAL DATA GENERATION

PROFECY Processes, Features and Cycles of Inner Peripheries in Europe

A Markov system analysis application on labour market dynamics: The case of Greece

2. Defining fisheries areas

Variance estimation on SILC based indicators

Corporate Governance, and the Returns on Investment

ACCESSIBILITY TO SERVICES IN REGIONS AND CITIES: MEASURES AND POLICIES NOTE FOR THE WPTI WORKSHOP, 18 JUNE 2013

Chapter 1: Europe Part 1: Teacher Notes

"Transport statistics" MEETING OF THE WORKING GROUP ON RAIL TRANSPORT STATISTICS. Luxembourg, 25 and 26 June Bech Building.

Modelling and projecting the postponement of childbearing in low-fertility countries

Final report for the Expert Group on the Integration of Statistical and Geospatial Information, May 2015

Improving rural statistics. Defining rural territories and key indicators of rural development

AP Human Geography. Course Materials

2005 HSC Notes from the Marking Centre Geography

Application Issues in GIS: the UCL Centre for Advanced Spatial Analysis. Paul Longley UCL

Mapping Welsh Neighbourhood Types. Dr Scott Orford Wales Institute for Social and Economic Research, Data and Methods WISERD

Rules of the territorial division

High-Technology Clusters: Specialisation, Interaction and Transportation. G. M. Peter Swann Manchester Business School University of Manchester, UK

Links between socio-economic and ethnic segregation at different spatial scales: a comparison between The Netherlands and Belgium

YEAR 7 REVISION BOOKLET

PHYSICAL FEATURES OF EUROPE. Europe Unit

2012 OCEAN DRILLING CITATION REPORT

Boundaries and Borders

40 Years Listening to the Beat of the Earth

AAG CENTER FOR GLOBAL GEOGRAPHY EDUCATION Internationalizing the Teaching and Learning of Geography

Spatial Trends of unpaid caregiving in Ireland

Brazil Paper for the. Second Preparatory Meeting of the Proposed United Nations Committee of Experts on Global Geographic Information Management

3. CRITERIA FOR SPATIAL DIFFERENTIATION SPESP

AP Human Geography. Nogales High School Class Website: bogoaphuman.weebly.com. Course Description. Unit IV: Political Geography

The more, the merrier? Urbanization and regional GDP growth in Europe over the 20th century

Geography Department. Summer transition work

Are knowledge flows all Alike? Evidence from EU regions (preliminary results)

Modelling structural change using broken sticks

Mathematics. Pre-Leaving Certificate Examination, Paper 2 Higher Level Time: 2 hours, 30 minutes. 300 marks L.20 NAME SCHOOL TEACHER

Distribution Pattern Analysis of Green space in Al-Madinah Using GIS Haifaa Al-Ballaa 1, Alexis Comber 2, Claire Smith 3

APPLYING BORDA COUNT METHOD FOR DETERMINING THE BEST WEEE MANAGEMENT IN EUROPE. Maria-Loredana POPESCU 1

Transcription:

Surnames as Indicators of Cultural and Linguistic Regions in Europe. James Cheshire 1, Pablo Mateos 1, Paul A. Longley 1 1 Department of Geography and Centre for Advanced Spatial Analysis, University College London. james.cheshire@ucl.ac.uk, spatialanalysis.co.uk KEYWORDS: Surnames, Europe, Clustering, Geodemographics, Lasker Distance. 1. Introduction The study of names is a many-sided enterprise with great and exciting intellectual potential (Zelinksy, 1997). This is especially true of European surnames where high linguistic and cultural diversity have produced a rich variety of surnames. Previously these have been subject to relatively little large-scale spatial analysis (see Colantonio et al. (2003) and Cheshire et al (2009) for full reviews). This study seeks to establish the degree to which the spatial distributions of European surnames form recognisable regions when compared to well-known broad linguistic and cultural areas. The quantity of surnames and geographic extent of the data is unprecedented in this field of research (Manni et al., 2005). The results provide an interesting classification of 16 European countries that can be utilised in future research as a basis for hypothesis generation and smaller scale studies. This study takes as a given that surnames vary over space and that these variations are culturally determined (Zelinksy, 1997). The focus here will be methodological by outlining an inductive approach to discovering and representing the regionalities in surname distribution that may exist across Europe. This approach utilises proven methods for the meaningful aggregation, hierarchical clustering and subsequent mapping of millions of surname locations obtained from telephone directories and censuses. 2. Methods Surnames are commonly mapped individually or in groups according to a shared characteristic, such as patronymic s names in Wales. These maps are appropriate for specific studies into a particular name or group but are inadequate for large scale, generalized studies. Through their interest in surnames as an indication of the genetic relationships between groups of people, geneticists have devised the Coefficient of Isonymy that provides a method of aggregating the information contained within the spatial locations of millions of surnames (Lasker, 1977). The Coefficient of Isonymy establishes the extent to which the same name (isonymy) occurs between the populations of two or more spatial units. It can be defined as: where p ia is the relative frequency of the i th surname in population A and p ib i is the relative frequency of the i th surname in population B. A lack of similarity between very diverse populations will produce very small Coefficient of Isonymy values that are hard to interpret and handle computationally. The Lasker Distance (Rodriguez-Larralde et al. 1994) is an extension of the Coefficient of Isonymy that produces more useful results. It is simply defined as: (1) (2)

The Lasker Distance values between spatial units can be thought of as distance in surname space. Larger values between groups suggest greater difference and smaller values greater similarity in terms of surname composition. The results of the calculation can be treated as a dissimilarity matrix which provides a convenient input for the Ward s hierarchical clustering and multidimensional scaling (MDS). Space does not permit a full justification and in depth explanation of these methods. An in depth analysis of a variety of clustering methodologies in this context can be found in Cheshire et al. (2009). Ward s (1963) grouping algorithm is a popular method of hierarchical agglomeration. The procedure forms hierarchical groups of mutually exclusive subsets in attribute space, each of which contains members of maximal similarity in terms of the specified characteristics (Ward, 1963). The algorithm begins by assigning the n initial number of observations to (n 1) exclusive sets by considering the union of all possible [n(n 1)/2] pairs for the functional relation that matches an objective function chosen by the investigator, and then proceeds by successive iteration (Ward, 1963). As with other hierarchical classifications (see Gordon, 1987), the outcome of clustering can be visualised as a dendrogram that illustrates the relationship between each observation and the rest, where all of the observations are joined together at the trunk of the tree. Each time two observations are joined, a new node is introduced with branches to the joined observations, the length of which are known as the cophenetic distance. This indicates the strength of the relationship between the observations (Kleiweg et al., 2004). Joining the clustering outcome to the boundary data enables the allocations to be shown as a choropleth map. Inspection of the resulting dendrogram with a view to allocate a number of clusters as close to the number of input countries (16) informed the decision to map 18 clusters. In addition to hierarchical clustering, MDS was used to provide an effective summary of the degree to which surnames registered in the same country are clustered in multidimensional space. MDS is a well established method of reducing the dimensionality of a data set from an m x n matrix with a large value of n to a similarity matrix with very few values of n (Everitt et al., 2001). MDS is well suited to studies where the distance measures (in this case Lasker Distance between areas) arise directly from prior analysis (Everitt et al., 2001). In this study reducing n to 3 provided the maximal data reduction whilst minimising the loss of information. For ease of plotting a reduction of n to 2 is shown. The results from the former are shown on the conference poster. 3. Data The data used in this study are a subset from the database created for UCL's World Names Profiler (www.publicprofiler.org/worldnames) that contains the surnames and approximate locations of approximately 300 million people from 26 countries. Analyzed here are the 16 European countries, between them containing approximately 5,950,000 million unique surnames. Two levels of geography (NUTS 1 and NUTS 2) are used in this study. A list of the 16 countries and the Nomenclature of Territorial Units for Statistics (NUTS) level of geography used in the analysis is provided in Table 1. In the UK, for example NUTS 1 corresponds to the Government Office Regions (GOR), whilst counties correspond to NUTS 2. The variation in NUTS levels used by this research is due to a lack of data at NUTS 2 level for Serbia, Macedonia, the Netherlands and Norway. The use of NUTS 1 for the remaining countries was prompted by very low populations within many of the NUTS 2 spatial units for these areas. In total the Lasker Distances between 763 spatial units were calculated. All the surname and location data were derived from publicly available telephone directories or national electoral registers from the 2000-2005 period. To our knowledge, no study of this kind has been completed on a continental scale before with this quantity of unique surnames.

4. Results and Discussion The mean Lasker Distance between the 763 spatial units was 10.46, with a range of values between 1.66 and 19.68. The smallest distances often occurred between contiguous areas and the largest distances across international boundaries. Table 1: A list of the countries, and their respective level of spatial granularity. The exclusion of spatial information (such as contiguity constraints, or distance weightings) from the hierarchical clustering makes the spatial uniformity of the clusters shown in Figure 1 especially impressive. Unlike Belgium which has been divided into a northern area of Dutch surnames and a southern area of French surnames, the multilingual countries of Luxembourg and Switzerland have been given unique cluster allocations rather than being partitioned along their known linguistic divisions. This suggests that a greater degree of surname mixing has occurred between the different linguistic areas of these countries than occurs between Wales, Scotland and England where each have been assigned a separate cluster allocation. Although not present in the 18 cluster outcomes, transitions in surname compositions are present along other linguistic boundaries, such as between Catalan areas of Spain and the rest of the country, but these require the dendrogram to be partitioned into a greater number of clusters, Figure 1: A map of the 18 cluster allocations produced from the Ward s Hierarchical Clustering of Lasker Distances. Each allocation is represented as a unique pattern. Cophenetic distances between adjacent clusters can be large, as is the case between Poland and Germany, or relatively small such as between England and Wales. Areas of no-data are white.

suggesting more subtle differences in surname composition between these areas. Of the Scandinavian countries, Norway and Denmark are more similar to each other than to Sweden. Italy's surname composition appears fragmented with a north/ south split. The former includes Rome and Sardinia, whereas the later has been split into two groups with the province Basilicata and part of Sicily forming one cluster allocation and the rest of Scilly and Southern Italy forming the second. The MDS plots in Figure 2 provide an effective means of gauging the similarity between spatial units within a country. From these it is clear that some countries have tighter distributions than others and these can be characterised as having a unique, in the context of Europe, but uniform composition of surnames within their borders. Ireland, Poland, Norway and Germany appear the most tightly clustered. By contrast countries where multiple languages are spoken, such as Switzerland, Luxembourg and Belgium, have more dispersed MDS distributions. In France, outliers include the more Germanic areas surrounding Alsace. Finally, although Serbia and Montenegro have been allocated a single cluster in Figure 2 their points in multidimensional space appear relatively distant from each other. This may be a product of the large spatial units and the high relative difference between the three areas of Serbia and Montenegro and the rest of Europe. It is recognised that naming conventions vary across Europe. It was felt unnecessary to account for this as the purpose here is to simply identify areas of similarity/ difference in surname compositions. For greater meaning to be attached to these results, such as genetic relatedness, conventions need to be accounted for in the initial Lasker Distance calculation. The unsurprising nature of many of the surname regions highlighted (judged by their conformity to well-known national and linguistic boundaries) provides strong evidence that the inductive approach of this study, as demonstrated through its data and methods, is appropriate when attempting to establish the existence of regional patterns in Europe s surname distributions. A great deal more variation exists beyond the 18 groupings outlined above which will be the subject of future research. In addition, the data quality, spatial scale and extent can be improved through additional cleaning of the database, geocoding the available address data to NUTS 3 or finer levels of granularity and obtaining data for the countries where it is lacking. The sensitivity of the analysis to the population sizes of the input spatial units also requires further investigation. This appears to be particularly important on an international level where the spatial units and their population sizes vary. Finally, many interesting patterns and distributions emerge on a national level that are beyond the scope of this paper, but could be easily investigated with the data and methods demonstrated by this research. Figure 2: Plots produced from the 2-dimensional MDS for each of the 16 countries. From top left the countries are: Norway (NO), Poland (PO), Serbia and Macedonia (SC), Sweden (SW), Ireland (IR), Italy (IT), Luxembourg (LU), Netherlands (NL), Denmark (DN), Spain (ES), France (FR), United Kingdom (GB), Austria (AU), Belgium (BE), Switzerland (CH), Germany (DE).

5. Acknowledgements The authors would like to thank the work of Muhammad Adnan and Maurizio Gibin for their work in assembling the database and linking it to the boundary data. This project was undertaken as part of James Cheshire s ESRC CASE PhD Studentship in Collaboration with ESRI (UK). The reviewer s comments were gratefully received and we hope have been fully addressed in the improvements made to the original paper and its associated poster. 6. References Cheshire, J., Mateos, P. Longley, P. 2009. Family Names as Indicators of Britain's Regional Geography. CASA Working Paper 149. Available from http://www.casa.ucl.ac.uk/publications/workingpapers.asp. Colantonio, S., Lasker, G., Kaplan, B., Fuster, V. 2003. Use of Surname models in Human Population Biology: A Review of Recent Developments. Human Biology. 75, 6: 785-787. Everitt, B., Landau, S., Leese, M. 2001. Cluster Analysis 4 th Edition. Hodder, London. Gordon, A. 1987. A Review of Hierarchical Classification. Journal of the Royal Statistical Society. Series A (General). 150, 2: 119-137. Kleiweg, P., Nerbonne, J., Bosveld, L. 2004. Geographic Projection of Cluster Composites. In Blackwell, A., Marriott, K., Shimojima, A. Diagrams 2004, Lecture Notes in Computer Science. Springer, New York. Lasker, G. 1977. A Coefficient of Relationship By Isonymy: A Method for Estimating the Genetic Relationship Between Populations. Human Biology. 49, 3: 489-493. Manni, F., Toupance, B., Sabbagh, A., Heyer, E. 2005. New Method for Surname Studies of Ancient Patrilineal Population Structures, and Possible Application to Improvement of Y-Chromosome Sampling. American Journal of Physical Anthropology. 126: 214-228. Rodriguez-Larralde, A., Pavesi, A., Siri, G., Barrai., I. 1994. Isonymy and the Genetic Structure of Sicily. Journal of Biosocial Science. 26: 9-24. Ward, J. 1963. Hierachical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58, 301:236-244 Zelinsky, W. 1997. Along the Frontiers of Name Geography. Professional Geographer. 49, 4: 465-466. 7. Biographies James Cheshire is halfway through his ESRC CASE PhD studentship (in collaboration with ESRI (UK)) in UCL s Department of Geography. His research focus is the spatial analysis of surnames and its applications. He is a also a research assistant on the Wellcome Trust s People of the British Isles project. James research can be followed at spatialanalysis.co.uk. Paul Longley holds a chair in Geographic Information Science at UCL and acts as Deputy Director of CASA. His publications include twelve books and more than 125 refereed journal articles and contributions to edited collections. He is a co-i on a node of the ESRC National Centre for E Social Science and a co-editor of the journal Environment and Planning Series B.

Pablo Mateos is Lecturer in Human Geography in the Department of Geography at University College London (UCL). His research interests lie within Population and Urban Geography and his work focuses on investigating ethnicity, migration and socio-spatial inequalities in contemporary cities. PhD in Geography (UCL 2007); MSc in GIS and Human Geography (University of Leicester 2004).