AN ASSESSMENT OF NUMERCAL CLASSFCATORY METHODS FOR GEOGRAPHY SHARFAH MASTURA SYED ABDULLAH Umversiti Kebangsaan Malaysia SNOPSS Perkembangan teknik pelbagaz angkubah adalah selan dengan perkembangan dan kemudahan alat komputer. Kertas zm cuba menilaz kegunaankegunaan salah satu danpada teknik-teknik pelbagaz angkubah mz, iaztu, analisa perkelompokan, Juga dikenali sebagaz kelaszfikasz numerikal. Dalam membmcangkan kertas mz prosedur-prosedur teknik perkelompokan mz diterangkan. Sebagaz satu contoh untuk menjelaskan teknik zm beberapa sedimen yang diambil danpada 30 sungaz digunakan. 'Coefficzent Dzsszmilanty zaztu 'Euclidean Dzstance' digunakan sebagaz ukuran persamaan dan lima strategz penyatuan dipilih untuk menunjukkan berbagaz-bagaz keputusan. SYNOPSS The development of the multzvanate techmque has grown parallel wzth the development and ready availability of digztal computers. Th,s paper attempts to- assess the usefulness of one of these multzvanate.techmques, that zs, the cluster analyszs also commonly known as numencal ClaSSficatOn. n thzs paper, procedures m developmg clustenng techmque are described. As an illustraton of ths techmque, contmuous data of vanous sediments taken from 30 streams are used. Disslmilanty coefficent (the Euclidean distance) zs used as measure of szmilanty and five fuson strategzes are selected to produce vanous results..ntroduction: The main surge of computer-based studies began in 1961 Wth the widespread use of machnes produced n the BM series. The demand on computer time ls now doubling every two years. The main uses of computers n systems analysis appear to be studies in multivariate statistlcs, trend-surface decompostlon, computer graphlcs and slmulaton (Hagget, 1969: 497-520). Workers n the 1950's, USng conventional statstics such as mean test, varlance analysls, correlation and regresslon analysis, were the first to respond to the facilitles and resources opened up by the computer. Studies were qulckly extended to include multivariate methods such as pnncipal 57
component analyss, factor analyss and multiple descnmmant functions (King, 1969). Researchers struggling through desk calculators to an exhausting five variables equaton were suddenly confronted with powerful 'package' programmes WhCh could not only compute coefficents for dozens of variables but provided means of selecting optmal sets from the rack (Hagget, 1969). Something of the power of multivariate analyss may be seen m the geographcal use of pnncipal component analyss and factor analysis. These techniques have the capability of reducmg immense arrays of data to a series of coherent and interpretable dimensons of factor. The amount of mathematics mvolved S normally outsde the range of mechamcal calculators, hence computer programmes have to be used. t must be stressed that multvanate methods, mcluding numerical taxonomy, have been applied far more m the field of bology and ecology when compared to the earth sciences, especally geomorphology. n the last decades or so there has been a swift growth of taxonomc methods m ecology. Since the poneenng work of Sorenson m 1948 who employed both coefficient of associaton and cluster analyss, others have followed SUt: Goodall (1953: 39-63), Williams and Lamberts (1959: 427-445), Greg-Smith (1964), and Whittaker (1967: 207-264). Multivariate analysis in Geography is pcking up m popularty. GeographCal mvestgatons are often of a multvanate nature m that they seek to analyse many attributes measured at different localities. Three of these techmques, namely, factor analyss, cluster analyss and multple discriminant analyss are Wdely used. ThS paper attempts to discuss how one of the multivanate techniques, that is, the cluster analyss, can be applied in geography. n ths discusson procedures m developmg clustermg techmques are described. Contmuous data of varous sediments taken from 30 streams are used as an illustration. Dissimilanty Coefficent, the Euclidean distance, S used as measure of similarity and five fuson strateges are selected to produce vanous results. l t should be emphassed, however, that ths paper is mamly concerned Wth the methodology rather than the problem solvmg work of ths techmque. NUMERCAL CLASSFCATON Classification can be derived subjectively or objectively or by varymg degrees of objectivity. This paper attempts to assess the usefulness of a clustering techmque WhCh S actually a numencal classification. Clustenng techmques have been developed m response to the followmg problem: gven a sample of 'N' objects or mdividuals, each of WhCh S measured on each of 'P' varables.. devise a classification scheme for grouping the objects into 'g' classes (Event, 1974). n clustenng methods, l. The analyss of the data S by the use of CL 1907 computer of Sheffield Umversty. The package programme used S custan la which S a comprehensive suite of Fortran V programme developed by Wishart 1969-1970. 58
Numerical Clalllfication Direct Claulflcatlon, Nohierarchical Figure. Nonhierarchlcal H lerarchlca Monothetic DiyislYe Hierarchical 1 Polythetic AgglomeratiYe. Nearest neighbour (single linkage) 2. Furtherest neighbour ( complet. linkage) :3. Group average (ayerage linkage) 4. Centroid sorting!5. The median (cower's method) 6. Ward's method (error sums of square) 7. Lance and William (flexlbl. b.ta) 8. Mc. Quitty method A choice of classification strategies. t S conventional to arrange data m the form of a N x P matnx m WhCh N columns represent the indivduals and P rows represent variables. The data matnx. is then used to estimate the resemblance between pairs of mdividuals. The scores m a data matnx may be expressed m many ways and they depend on the nature of the variables. They may be presence/absence data, rank data, menstc data or contmous data. Cluster analyss or numencal classification S divded mto two'distmct types: the herarchcal or non-herarchcal. The former S equated wth the production of a dendrogram while the latter S often known as rectculate. The herarchcal type S divded mto two major groups or methods. The first one S the divsve herarchcal method which begms wth a complete set of entities. These entties will then be classified and divded progressvely mto smaller groups. The second group S tpe agglomerative herarchcal method which starts Wth a smgle entty t then consders WhCh other entity S most smilar to the chosen one m some defined sense, followed by a thrd entity WhCh S most smilar to the first and second entties and ths proceeds to build up a cluster of entites. At some pomt t may be decded that none of the remammg enttes S Smilar enough to be assocated wth the cluster bemg formed and a new cluster S mluated. MONOTHETC AND POLYTHETC METHODS: The terms "monotheuc" and 'polytheuc" have been assocated wth herarchcal classificaton. n monotheuc procedures every group at every stage S bemg defined by the presence or lack of specal attributes, or m other words, the umon S based on one attribute. n polytheuc procedures 59
2 43 2 22 1 99 1-77 1 55 1 32 'll 0 88 0 66 0 44 0 22 r-,.l...r--- nlff ~-::r==1 ::::r;='"':;:=~ ~nl~ 0~Y2H3-+24-1~7~11~9-+26-1~3~~H8~*14~2b5~23~4~2~9~2~1 +5t6~2~7~~~2±8~~~15~9t7t8~16f~~~ Figure 2: Polythetic agglomerative classification employing Euclidean distance similarity coefficient and displaying nearest neighbour sorting strategy the Unon S based on ail the attributes. The agglomerative monothetlc cannot exst but m a trival sence. At the other extreme the divsve poly the tic classification S computationally out of the question for most researchers and thus t has not been sufficently developed. The most popular classifications m use at present are the divsve monothetc and agglomerative polythetc. Agglomerative polythetc method S, however, histoncally older than the divisve mono the tc, denvmg at least from the work of Kulczynski m 1972. The agglomerative poly the tic will be dealt with in greater detail below as this method is comparatively more supenor and popular than the other methods. AGGLOMERATVE. POLYTHETC METHOD: 60 The first stage m agglomerative polythetc method S the calculation of
5 31 4 82 4 34 3 86 3 37 2 89 2 41 1'93 1-14 0 96,... l~nj~ r- - 0'48 o nl.12 2923324 17 2111 1412 1813 2519 266272228 3010 209 1578 r-- - ;-- Figure 3: Poly the tic agglomerative classification employing Euclidean distance similarity coefficient and displaying furtherest neighbour sorting strategy similarty matrx between two individuals using some kind of mdex of coefficient. ndex of similarity coefficient can be broadly divided into four groups; distance coefficient; association coefficent: correlation coefficent; and probablistic similarity coefficient. Distance coefficlent measures distances between individuals in a space defined in various ways, and the most familiar measurement is simple Euclidean distance. Distance coefficient is the converse of smilarity; it is in fact a measure of dissimilarity. t has great intellectual appeal to taxonomists as it S easiest to vsualise (Sneath & Sokal, 1973). Association coefficient may involve quantitative data. t can be applied to rank and contmuous data by sacrificing mformation. Correlation coefficient measures proportionality and independence between pairs of indivdual sectors. The product moment correlation coefficient is frequently applied 16 61
3 2 2 9 2 6 2 3 2-0 4 1 8'8 5 8 2 9 122932417111218 14132519 264232156272228 301510 2071 16. Figure 4: Poythetic agglomerative classification employing Euclidean distance Smilarity coefficient and displaying centrod method sorting strategy to contmuous data. Probabiistic coefficient is more recent and t mcludes mformaton-type statistics Whch measure the homogenety of the data by partitioning or sub-parttomng sets of indivduals (William & Dale, 1965: 35-68). The next stage is the sorting strategy of fuson. The sortmg strategy mvolves groupmg togeth~r of mdividuais to form a dendrogram. Eight types of sortmg strateges are recogmzed. They are nearest neghbour or smgle linkage, furtherest neghbour or complete linkage, centrod, medium, group average, Ward's, Mcquitty and the Lance and William flexible Beta ~ethod. The result of fuson S expressed in a dendrogram form. The distance away from the base line at which two mdividuais or groups Jomed is a direct expresson of theu degree of similarty. Hence the most Smilar 62
3 68 3 32 2 95 2 58 1 84 1 47 1 1 0 73 26 4 2J 5 28 226 27 309 1578 1110 20 Z5 FiQure 5: Poly the tic OOCJlomerative classification employlnq Euclidean distance Similarity coefficient and displaying group average sorting strategy groups are first jomed. The best fuson will depend on ts ability to form a good group structure. Figure 1 shows the stages of numencal classification techniques already described above. The differences between these anse essentally because of the different ways of definmg distance (or similanty) between an indivdual and group contammg several vanables or between groups of mdividuais. Test runs on agglomerative poly the tc techmques of numerical classificaton are given in Figures 2,3,4,5 and 6. The variables consist of 13 different sediment samples taken from 30 different streams. Euclidean distance coefficient S used for similanty measures and fusion strateges used are the nearest neighbout, furtherest neighbour, centrod, group average and Ward's method. Some methods do not produce satisfactory clusters. Usually, ths is due to chainmg effects as seen m Figure 2. Chaining S a tendency of clustenng 63
18-6 16-9 15-2 11-85 10-1 8-4 6-7 15 2822& 27 3078 n 1510 202293 24 17411 1412 Figure 6: Poythetic OCJglomerative classification employing Euclidean distance Similarity coefficient and displaying word's method sorting strategy together certam mdivduals at a relatvely low level and to mcorporate mdivduals mto exstmg clusters rather than to generate new clusters. Generally, chammg is consdered to be a defect. However,:Jardine and Sibson (1968 177-184) argue that nearest neghbour is the bestmathematcal method and pomt out that to treat chammg as a defect S msleading. Chammg, according to them, is smply a descnpton of what a method does and if one S lookmg for optmally connected clusters, and not for homogenous clusters, such a method may be useful. However, because of chammg, nearest neighbour may fail to resolve relatvely distinct clusters. Many of the densty reach techmques arose from attempts to correct the effect of chammg m the smgle linkage method (Eventt, 1974). Figures 3, 4 and 5 produce slightly better dendrograms than the nearest neghbour method. However, Wth the excepton of Figure 3, the mbalance m the two clusters S ObVOUS. But Figure 4 (centrod method) 64
shows some reversals whch reflect another weakness of a dendrogram. "Reversals effect" occurs when the computer pnnts one fuson whch S at a lower level than has already been drawn on the diagram. Figure 6 (Ward's method) produces dendrograms that are COnspCUOUS for ther symmetncal herarchcal structure even though the 'steps' or change n levels S average. The dendrogram dentifies two large groups. DStnCt smaller groups are clearly present. n concluson, the above discusson illustrates that geographcal data can be analysed USng agglomeratve poly the tc wth five selected sortng strateges. The Ward's method shows good structured dendrogram while the others show weaker, structured dendrograms Wth channg, reversals and some crowding. Thus, suffice to say that the choce on the sortng strategy does nfluence the structure of the dendrogram and, therefore, to a certan extent the results of the analyss. The most mportant factor S that the user must be aware that the choce showed suts the data that one S dealing Wth so that the results would easily be grasped and analysed. On the whole, the agglomeratve polythetic techmque m analysng geographcal data should be useful. Firstly, most geographcal data are multivanate m nature and ths suts well Wth the numencal classificaton or cluster analyss techmque whch deals only wth such data. Other uses of ths techmque relevant to analysmg and solvng geographcal problems nclude finding a true typology, model fittmg, predicton generatng and data reducton (Eventt 1974). Applicaton of the cluster analyss techmque can also and has been used n the other fields of socal SCences and humamties. For example, Buechley (1967 53-69) has made a cluster analyss of family names m different geographc localites. Wishart and Leach (1970: 90-99) have used several methods of clustenng n examnng the works of Plato so as to determme the probable chronology, and Crumm (1965: 350-362) has studied the votng behavour of legslators by average linkage clustenng. However, n applyng the cluster analyss techmque n the socal SCence, one major problem that has to be consdered S the queston of quantifymg vanables. Some data m the socal SCences are difficult to quantify and therefore would pose a problem to the data nput. Such a problem S least faced by the physcal SCences. n short, n muitivanate analyss, the cluster analyss method has applicatons n geography as well as n other fields of study REFERENCES Buechley, R.W "Charactenstc Name Sets of Spamsh Populaton." 1967 Names. 15: 53-69. Eventt, B. Cluster Analysls. SOCal SCence Research Council, London: 1974 HEB Ltd. Codall, D.W "Objectve Methods for the Classificaton of Vegetaton.. 1953 The Use of POStve nterspecific Correlaton." ] Bot. 1: 39-63. 65
Greg-Smth, P. (1964) Quantztatzve Plant Ecology. 2nd. Ed. London: 1964 Butterworth. Grumm, J.G. "The systematc analyss of blocs the study of legslatve 1965 behaviour." Western Politzcal Quart. 18: 350-362. Hagget, P. (1969) "On Geographcal Research a Computer 1969 EnVronment. Geogr. ]. 135: 497-520. Jardine, N. and Sibson, R. "The Constructon of HierarchC and Non- 1968 Hierarchic Classificaton." Computer ]. 11: 177-184. King L.J. Statzstzcal Analyszs zh Geography. New York: Englewood Cliff. 1969 Sneath, P.H.A. and Sokal, R.R. Numerzcal Taxonomy. San Francisco: 1973 Freeman. William, W.T. and Dale, M.B "Fundamental Problems Numencal 1965 Taxonomy." Advance Bot. Res. 2: 35-68. William, W.T. and Lambert, J.M. "Multvariate Methods Plant 1959 Ecology. V. Similanty Analysis and nformation Analysis." ]. Ecol. 54: 427-445. Wishart, D. and Leach, S.V. "A Multivariate Analyss of Platonc Prose 1970 rythm." Computer Stud. Humanztzes Verbal Behav. 3: 90-99. 66
CATATAN KEPADA PENULS GAYA PENULSAN AKADEMKA J ernal lmu Sains Kemasyarakatan dan Kemanusiaan Universiti Kebangsaan Malaysia Makalah Penulis hendaklah mengirimkan 3 salinan (termasuk satu salinan asal dan 2 salinan berxerox) bagl mempercepatkan proses penilalan makalah itu. Penulis hendaklah menuliskan nama serta tajuk makalah di atas kertas yang berasmgan. Tinggalkan ruang (margin) sekurang-kurangnya 1 inci, dan gunakan kertas berukuran quarto. PanJang makalah termasuk jadual, catatankakl, dan bibliografi mestilah udak melebihi 40 muka serta bertaip dengan menggunakan 3 langkau (tnple-space manuscnpt).' Makalah boleh ditulis dalam bahasa nggens atau bahasa Malaysa. Bagl makalah berbahasa nggeris ejaannya hendaklah disesuaikan dengan ejaan Oxford Dictionary, bagl makalah dalam bahasa Malaysa ejaan Kamus Dewan (ejaan baru) hendaklah dirujuk. Catatankaki Sebaik-baiknya catatankaki janganlah memasukkan bahan-bahan rujukan atau bibliografi. Catatankaki hanya mengandungt keteranganketerangan tambahan yang kurang sesuai untuk dimasukkan di. dalam teks. Catatankaki hendaklah ditunjukkan di dalam teks dengan menggunakan angka-angka 1,2,3 dan seterusnya, dan hendaklah ditalp pada muka yangberasmgan dan dilampukan pada bahaglan akhir teks, iaitu sebelum bahan rujukan. Penghargaan boleh dimasukkan sekcili dengan catatankakl mengikut susunan angka tadi. Rujukan/Bibliografi RUJukan hendaklah dinyatakan di dalam teks menurut slstem Harvard. Misalnya: (Dablan 1976:10), laitu ianya mengandungl nama pengarang, tahun dan muka. Sekuanya kita merujuk kepada seluruh makalah atau buku, udak perlu kita tuliskan mukanya. Sekiranya tulisan yang sarna dirujukkan, akan ditulis sekali lagl seperti di atas tadi, (Dahlan 1976:15). Jika dua atau lebih makalahlbuku yang dirujuk itu telah ditulis oleh pengarang yang sarna dan dalam tahun yang sarna, maka rujukan diterblto dibuat seperti: (Dablan 1976a:1O) dan (Dablan 1976b:2) dan sebagamya. Pecahan ke 'a', 'b', dan seterusnya akanberpandu kesusunan abjad berdasarkan huruf pertama pada tajuk-tajuk makalah/buku ltu. Satu senaral rujukan hendaklah disediakan di akhu makalah:senarai 67
m hendaklah ditulis berasmgan serta diatur menurut susunan abjad. Contohnya sepertl di bawah mi. Jemal Dahlan, H.M. "Malay Traditonal SOCety and a Colomal 1976 Encounter". Akademika, Bil. 9, m.s. 1-20. Keats, D.M., Keats, J.A., dan Mohammad Haji Yusuf 1976 "Attribution of Reasons for Religious Beliefs m Four Ethmc Groups". Akademika, Bil. 9, m.s. 21-28. Buku Kolko, James The Limits of Power, New York: 1972 Harper and Row Publishers. Oberg, Kalervo Culture Shock. ndianapolis: 1974 Bobbs-Merril Company. Makalah Dalam Buku Lazarus, Rchard S. "Cogmtve and Personality 1967 Factors Underlymg Threat and Copmg", dalam Mortmer H. Appley and Richard Trumbull, penyuntmg, n Psychological Stress, New York: Appleton - Century Crofts. SNOPSS Penulis hendaklah melengkapkan 2 sznopszs pendek, satu dalam bahasa Malaysza dan satu lagz dalam bahasa nggerzs. Proof Jika masa menglzinkan, penulis-penulis makalah akan menenma salinan proof untuk disemak bag kesafahan-kesalahan percetakan sahaja. Sekuanya masa mendesak dan tempat tmggal penulis terlalu jauh, salinan proof akan disemak oleh sdang pengarang sendiri. Sidang pengarang mempunyai hak untuk membuat sebarang "editorial revson" tanpa menghubung penulis. 68
A NOTE TO CONTRBUTORS AKADEMKA Journal of Humanztzes and Soczal Sczences Natzonal Unzverszty of Malaysza GUDE FOR CONTRBUTORS: ARTCLES Authors should send 3 copzes (the orzgznal and two photostat copzes) to expedite evaluatzon of thezr artzcles. Authors should wrzte thezr names as well as the tztles of thezr artzcles on a separate sheet of paper. Leave at least one-znch margzns and use quarto-szzed papers. Each artzcle, zncludzng tables, footnotes, and biblzography must not exceed 40 pages zn length and should be type-wrztten zn trzple spaczng. Artzcles may be wrztten ezther zn English or Bahasa Malaysza. SPellzngs zn the Englzsh-language artzcles' should conform to those zn the Oxford Dzctzonary and Bahasa Malaysza artzcles to those zn Kamus Dewan (new spellzng). Footnotes As far as possible, footnotes should not znclude references or bibliographzes. They should only conszst of additzonal znformatzon whzch are not suztable for zncluszon zn the text. Footnotes should be zndzcated zn the text by the use of numbers 1, 2, 3 and so on, and should be type wrztten on a separate sheet and attached at the end of the text zmmediately before the list of references. Acknowledgements may be zncluded together zn the footnotes zn numerzcal order Reference! Bibliography References shown zn the text should follow the Harvard System. For example: (Dahlan 19.76: 10), that zs, zt zncludes the author's name, year and page. Where reference zs made to the artzcle or book as a whole, page numbers need not be zndzcated. f the same artzcle or book zs subsequently referred to, zt zs necessary to repeat what has been wrztten before, that zs, (Dahlan 1976 : 15). f two or more artzcles! books referred to have been wrztten by the same author zn the same year, then the references S made thus: (Dahlan 1976a : 10) and (Dahlan 1976b 2) and so on. The subdivzszons 'a', 'b', etc. should be zn alphabetzcal order based on the first letter of the tztles of the artzcles! books. A list of references should be made available at the end of the artzcle. The list should be type wrztten separately and arranged zn alphabetcal order. 69
Some examples are shown below: Journal Dahlan, H.M. "Malay Traditzonal Soczety and a Colonzal Encounter", 1976 Akademika, No.9, p. 1-20. Keats, D.M. Keats, J.A., and Mohammad Haji Yusuf "Attributzon of 1976 Reasons for Religzous Beliefs zn Four Ethnzc Groups". Akademika, No.9, p. 21-28. Books Kolko, James The Lzmzts of Power. New York: Harper and Row Publishers. 1972 Oberg, Kalerov Culture Shock. ndianapolis: Bobbs-Merrill Company. 1974 Artzcles from Books Lazarus, Rzchard S. "Cognztzve and Personality Factors Underlyzng 1967 Threat and Copzng". n Mortzmer H. Appley and Rzchard Trumbull, editors, n Psychologzcal Stress, New York: Appleton Century Crofts. Synopszs Authors should complete two short synopszs, one zn English and the other zn Bahasa Malaysza. Proof Time permzttzng, authors shall recezve a copy of thezr pre-publicatzon artzcles to be proof read. However, zf tzme does not permzt, proof reading will be done by the Editorzal Board. The Editorzal Board of Akademika reserves the rzght of "edztorzal revzszon" without consultatzon wzth respectzve authors. All editorzal correspondence and artzcles for submisszon should be addressed to:- The Editor, Akademika, c/o Department of Politzcal Sczence, Unzversztz Kebangsaan Malaysza, Bangz, Selangor, MALAYSA. 70