COMPARING NUMERICAL TAXONOMIC STUDIES. Abstract

Size: px

Start display at page:

Download "COMPARING NUMERICAL TAXONOMIC STUDIES. Abstract"

Heather Palmer
6 years ago
Views:

Syst. Zool., 30(4), 1981, pp. 459-490 COMPARING NUMERICAL TAXONOMIC STUDIES F. JAMES ROHLF AND ROBERT R. SOKAL Abstract Rohlf, F. J.' (IBM Thomas J.

1 Syst. Zool., 30(4), 1981, pp COMPARING NUMERICAL TAXONOMIC STUDIES F. JAMES ROHLF AND ROBERT R. SOKAL Abstract Rohlf, F. J.' (IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598) and R. R. Sokol (Department of Ecology and Evolution, State University of New York, Stony Brook, New York 11794) Comparing numerical taxonomic studies. Syst. Zool., 30: Recent proposals to measure the degree to which given taxometric methods meet goals defined by the three current schools of classification have led to quantitative comparisons of the methods. To aid in understanding such comparisons, a flow chart of taxonomic procedures is presented. Optimality tests are reviewed for each type of procedure. Possibly desirable properties of classifications include: the fit of a summary representation to a similarity matrix, stability, general utility, fit to a known cladistic relationship, and optimality criteria of numerical phylogenetic methods. We examine how they relate to the professed goals of the taxonomic schools and whether they can be used for comparative evaluations between these schools. Previous attempts at comparing numerical classifications are reexamined. Such comparisons have largely been made improperly. Published comparative tests of taxonomic congruence are based on inappropriate comparisons or were improperly executed and cannot furnish evidence on relative stability of phenetic, evolutionary, and phylogenetic classifications. Reports which claim to show that numerical phylogenetic classifications result in better fits to original similarity matrices than phenetic methods and therefore retain distance information better than phenetic classifications are shown to be misleading. In the first such study, the comparison was not relevant to the question asked. In all of these studies the results were biased in favor of phylogenetic methods by retaining redundant information during the computation of matrix correlations for the phylogenetic methods. In two later studies based on ten taxonomic data sets, the comparisons for the phylogenetic methods were in terms of unrooted trees rather than hierarchic classifications. By limiting the reference OTU to OTU 1 in each data set, results were obtained in these studies, that tended to favor the phylogenetic methods considerably more than if some other reference OTUs had been employed. Only in a few cases is there a significant increase in fit with the phylogenetic methods. Interpreted as classifications, UPGMA clustering of the original dissimilarity matrix gives the best fit in the majority of cases when compared with rooted trees (minimum length and least squares fitted). For these data, there is no evidence that classifications by any "phylogenetic" technique yield better summaries of phenetic information than UPGMA. A recent study of predictivity, while correctly designed, yielded complex results with no clear preference for any one school of taxonomy. Thus there is no current acceptable evidence that numerical phylogenetic methods yield classifications which contain more information than either phenetic or evolutionary ones. [Numerical taxonomy; classification; phenetics; phylogenetics; cladistics]. In this paper, we examine the methods proposed for measuring the degree to which various numerical methods meet the differing goals proposed by the various classificatory schools. Are given numerical methods consonant with the stated goals of a given school of taxonomy and can one develop criteria which measure how well these methods meet such goals? We also discuss the validity of us- 1 Present address: Department of Ecology and Evolution, State University of New York, Stony Brook, New York ing the same criteria to compare phenetic and cladistic approaches to taxonomy. We first describe the normal flow of procedures in a numerical taxonomic study. Next we enumerate the types of criteria that have been used to evaluate the optimality of such procedures and discuss appropriate ways by which comparisons can be made between alternative taxonomic procedures. We suggest methods for improving the evaluation of numerical techniques and outline some principles for studies comparing numerical phenetic and phylogenetic techniques. Then we examine several pub-

460 SYSTEMATIC ZOOLOGY VOL. 30 lished taxometric studies attempting to evaluate the relative merits of phenetic and cladistic classifications using numerical methods.

2 460 SYSTEMATIC ZOOLOGY VOL. 30 lished taxometric studies attempting to evaluate the relative merits of phenetic and cladistic classifications using numerical methods. Finally, we present some conclusions concerning the validity and results of such comparisons. Controversy in systematics, which in the early 70's seemed to have quieted down, has in the last three or four years come to life again with a vigor and intensity surpassing that of the spirited debates of the late 1950's and 60's. We welcome the renewed challenge to systematists to re-examine the principles by which they operate. The larger philosophical issues dividing phenetic taxonomy, evolutionary systematics and Hennigian cladistics are beyond the scope of the present article. We hope to address them at a later time. A FLOW CHART OF TAXONOMIC PROCEDURES For purposes of the subsequent discussion it will be useful to formalize the customary flow of procedures in a numerical taxonomic study. We may do so by the scheme illustrated in Figure 1. The study starts with objects (specimens) or OTUs on which we have made a series of observations with respect to the characters that differentiate them. Let the objects or OTUs, each described by a vector of observations, be symbolized by the letter 0. From these observation vectors we prepare a data matrix conventionally symbolized by X, to represent the distribution of character states over the set of OTUs. Such a matrix is of dimensions n x t for n characters and t OTUs, respectively. A resemblance or similarity matrix S of dimensions t x t (between all pairs of OTUs) is frequently computed from the data matrix. Although a dissimilarity matrix, such as a distance matrix among OTUs, is formally complementary to a similarity matrix, we shall for convenience use the symbol S to stand for both types of matrices. Next we operate on a resemblance matrix S to obtain a summary of the relationships contained therein. Such sum- maries may take several forms. The most common summarization is as a dendrogram which is a dendritic structure representing the taxonomic relationship among OTUs. We have designated these taxonomic structures with the symbol D. Examples of such dendrograms are phenograms, the common results of numerical phenetic studies; or cladograms, as in the results of a Camin-Sokal numerical cladistic procedure; or phylograms, in which branch lengths as well as branch sequences are considered, as in a rooted Wagner tree. A second class of methods for summarizing resemblance matrices is ordinations, such as principal components analysis or nonmetric multidimensional scaling. We shall symbolize such summaries by M. The construction of graphtheoretic trees, here symbolized by T, is another common representation. Note that trees in a graph-theoretical sense consist of a set of t vertices (OTUs), and t 1 edges (internodes). These edges may or may not have associated with them lengths expressing the dissimilarity between vertices. As pointed out by Sneath and Sokal (1973:324) much of the taxonomic literature has employed incorrect terms for numerical phylogenetic constructs. Trees have often been called networks (e.g., Wagner network), directed trees have been called trees (e.g., Wagner tree), and so forth. We have retained the established terminology of graph theory in this account (see, for example, Busacker and Saaty, 1965, or Harary, 1969). In some methods intermediate vertices may be constructed so as to minimize the overall length of the tree. Such vertices, also known as Steiner points, correspond to hypothetical taxonomic units (HTUs). A goal of taxonomic methodology is to arrive at a classification. The summaries of resemblance matrices mentioned above are not yet classifications. By classifications we mean partitions of the OTUs into hierarchic classes of OTUs with no overlapping at any given categorical level that is, nested sets as in the famil-

3 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 461 t CT& s t n Char X t S D > C 0 1. Scheme illustrating the customary flow of procedures in a numerical taxonomic study. Objects or OTUs (0) are described by a vector of observations and assembled as a data matrix (X). From this a resemblance or similarity matrix (S) is computed. By means of a procedure, such as a clustering algorithm, one obtains a dendrogram (D), here represented as a phenogram. Alternative ways of summarizing information contained in the similarity matrix are as ordinations (M) or graph-theoretically as trees (T). These methods are not classifications as generally understood but can be transformed into classifications (C) as shown. The arrows show the flow of the computational procedures. Thus some ordinations are obtained directly from the data matrix as shown by the arrow from X to M. >C >C iar Linnean system (see also Eldredge and Cracraft, 1980:168ff.). The classificatory relationships among the OTUs in a classification are ultrametric regardless of the methods (e.g., phenetic or cladistic) by which the classification has been obtained. Such relationships are neither phenetic nor cladistic but can be defined for two OTUs i and j in terms of the rank of the lowest level taxon to which i and j both belong. Since this fundamental point is not as widely appreciated among systematists as it should be, and since the information on the nature of ultrametric relationships is scattered and some of it is couched in mathematical language unfamiliar to most biologists, we have furnished in Appendix A a brief account of what an ultrametric is and why the classificatory relationships among the taxa of a Linnean classification must indeed be ultrametric. We have indicated classifications by the symbol C. The various kinds of dendrograms can be easily transformed into classifications. The partitioning of the OTUs is indicated by the furcations and the relative levels of the categories are determined by the scale in

462 SYSTEMATIC ZOOLOGY VOL. 30 the case of phenograms. For cladograms or phylograms we may employ algorithmic rules, such as ordering by absolute time or relative time, i.e., by furcations from the root.

4 462 SYSTEMATIC ZOOLOGY VOL. 30 the case of phenograms. For cladograms or phylograms we may employ algorithmic rules, such as ordering by absolute time or relative time, i.e., by furcations from the root. Such direct transformations are shown by the arrow D *C. The other summarizations of resemblance matrices, ordinations and trees, cannot be directly transformed into classifications. To turn an ordination of OTUs into a classification one requires a cluster analysis of the distances implied by the ordination. Strictly speaking, one thus goes through steps involving similarities, S, and dendrograms, D, again before arriving at the classification C. These extra steps are indicated by the ellipsis in the arrow leading from M to C in Figure 1. One way of transforming trees into classifications is to root them. This turns a tree into a dendrogram and is shown by arrows leading from T to C via D. Alternatively, there exist algorithms for directly converting an unrooted tree into a classification, as in converting a minimum spanning tree into a single linkage phenogram. This is shown by the direct arrow T C. There are still other ways of summarizing similarities that cannot be directly transformed into classifications, such as Jardine-Sibson Bk-clusters (for k > 1; Jardine and Sibson, 1968), but we shall not deal with them in detail here. The symbols just presented can be used to specify the types of operations generally performed in a numerical taxonomic study, as follows. An 0 > X procedure is usually termed coding, the process of turning an original assemblage 0 of descriptions of OTUs into a data matrix X. The computation of a pairwise similarity function is an X > S procedure, whereas a classificatory procedure transforming a similarity matrix into a classification is usually an S > D > C procedure, although it could proceed via T or M. In some cases, as in systematic immunology, there is no X matrix since the similarities, S, are obtained experimentally. The classificatory process can also operate directly on the data matrix (e.g., Hartigan, 1975) and thus becomes an X > D > C procedure (although there is usually an implied intermediate similarity matrix). Similarly, some ordination procedures such as principal coordinate analysis (Gower, 1966, 1967) can be designated as S M procedures, whereas S > T would indicate a tree-forming procedure, such as an unrooted distance Wagner tree (Farris, 1972). Again, X > M (e.g., principal components) and X > T (e.g., unrooted Wagner tree) procedures are also employed. As already mentioned there exist also classificatory procedures turning ordinations into classifications, M > D > C. The philosophical basis (phenetic or cladistic) of the procedure is irrelevant to the terminology. In comparing various procedures for their effects on phenetic or cladistic classifications, we must be careful not to be misled by labels attached to methods or to persons applying these methods. Although certain methods have been associated with one or another school of taxonomy (thus UPGMA clustering is typically used by pheneticists, the Wagner distance procedure by cladists), one may in principle examine the results of such procedures in terms of the criteria for any system of classification, not necessarily one associated with the method. For example, Farris (1977, 1979a, b) has attempted to show that various of his procedures, which he identifies as phylogenetic techniques, yield more naturalness as defined by phenetic taxonomists than does UPGMA clustering, which he calls phenetic similarity clustering. The terms phenetic and cladistic should ideally be restricted to taxonomic relationships and to classifications based on these relationships, not to specific methods. There is even less justification for applying these terms to a method based on the supposed philosophical orientation of the person who developed a given method a tendency that has been noted in some of the more partisan contributions to the recent literature. We prefer to retain Sneath and Sokal's (1973:29) definitions of phenetic relationship as "similarity (or resemblance)

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 463 based on a set of phenotypic characteristics of the objects or organisms under study" and of cladistic relationship as "a branching (and occasionally

5 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 463 based on a set of phenotypic characteristics of the objects or organisms under study" and of cladistic relationship as "a branching (and occasionally anastomosing) network of ancestor-descendant relationships." Thus phenetic relationships are all those based on similarity, whether this is overall similarity based on equally weighted characters or a similarity coefficient based on an unequal weighting of characters. Cladistic relationships by the definition we employ must involve evolutionary branching sequences. Thus we would interpret a synapomorphy scheme, by a Hennigian cladist who does not proceed to make inferences about the genealogy of the OTUs being studied as a special kind of phenetic relationship. Classifications should analogously be characterized by the nature of the intended inferences. If a classification is intended to represent a similarity scheme it is phenetic, if it is intended to show evolutionary branching sequences it is cladistic. These terms apply equally to estimates of the true, underlying relationships of similarity or genealogy as well as to these "parametric" relationships themselves. Thus relationships based on a hypothesized evolutionary tree are considered cladistic, as are those based on the true cladogram in cases where this may be known. In some data sets, phenetic and cladistic classifications might be identical, yet it is important to distinguish the two separate bases of classification. OPTIMALITY CRITERIA We can now consider criteria of optimality for each of the classificatory steps just outlined. Such criteria for an 0 > X procedure should depend on the purpose of the procedure undertaken, the nature of the algorithm employed and our assumptions about the underlying evolutionary changes and morphogenetic pathways of the characters describing OTUs. Whether characters should be coded as binary or multistate will in part depend on the type of analysis we carry out. But the type of character coding should not only be determined by the types of coefficients and classificatory processes that will be applied, but also by underlying assumptions about the evolutionary dynamics of the group being classified. Do we conceive of character state change as changes of discrete units forming a linear or branched series or is it more useful to code character states as class marks along a continuous spectrum of variations? Although various schemes for character coding have been proposed there has been no definition of optimality criteria for this procedure, nor a comparative study evaluating techniques such as additive and nonadditive binary coding, gap coding, and related procedures (see, however, the discussion of information and unit characters in Sneath, 1957, and Sneath and Sokal, 1973:72). The choice of an optimality criterion for an X ---> S process, the computation of similarity coefficients, depends on classificatory philosophy. A phenetic classification will require a definition of similarity and the construction of an index of similarity that best reflects that definition and the nature of the character coding undertaken. Hennigian cladistics will require a particular kind of similarity, namely similarity with respect to uniquely derived character states. If these are known, as in a model phylogeny, then a similarity coefficient can be so devised as to bring out just such special similarities. If, as in virtually all real situations, we do not know the true phylogeny of the group, similarity coefficients have been devised that estimate such a similarity by making various assumptions. Optimality of an S D ---> C process, a clustering and classificatory procedure, has frequently been tested by measuring the agreement between the similarity matrix S and the implied ultrametric of the classification C. This can be done by any of the various measures of comparison such as the cophenetic correlation coefficient or Jardine and Sibson's (1972) coefficient A L. Optimality of classification C may also be tested by an X C procedure where

6 464 SYSTEMATIC ZOOLOGY VOL. 30 the partition imposed upon the data matrix by a given classification C can be evaluated in several ways, such as in terms of the overall homogeneity of character states (in some phenetic models), or as the most parsimonious hypothesis in terms of minimum length of character state trees (in many cladistic models), or by maximum predictivity only for derived patristic characters (in some cladistic models). Optimality of S ---> T or X > T procedures such as fitting minimum spanning trees or unrooted Wagner trees to data has been tested in analogous ways. The optimality of these particular trees is usually specified in terms of their length. It is also possible, however, to evaluate the amount of distortion in the implied similarity between OTUs by using the matrix correlation technique. For the various probability models the optimality of a given tree can be expressed in terms of its likelihood given some probabilistic model. The optimality of ordination techniques, S > M or X ---> M procedures, are most often measured in terms of percentage of the variance explained (the parameter optimized in principal components analysis and principal coordinates analysis) or stress (the parameter optimized in nonmetric multidimensional scaling analysis). It is of course also possible to compute matrix correlations between the distances implied by the ordination analysis and the original similarity matrix (Rohlf, 1972). It is somewhat less clear how one should best compare summarizations D, C, T, and M to determine which of them is an optimal representation of the data X or of similarities S. Comparing the original S matrix with similarities or dissimilarities implied by the summary representation could determine which class of techniques gave the minimum distortion. Yet the problem in making such a comparison is that the S D, S C, S > T, and S M procedures involve different numbers of parameters so that it is not obvious whether one should, for ex- ample, compare a cluster analysis with a one-, two-, or three-dimensional ordination technique. The types of comparisons of methods of numerical classification, are illustrated in Figure 2. Differences in methods can be for 0 X, X > S, S > D, or D C procedures as shown in the second, third, fourth or fifth diagram in Figure 2, respectively. Conventional experimental design would suggest that procedures previous to the comparison, as well as those that follow, be identical so that any differences in the results can be ascribed to differences in operations at the single step. However, this will not always be possible since in some cases the introduction of an alternative procedure at a given step necessitates changes at subsequent steps as well. Thus two different methods of coding as shown in the second arrow diagram of Figure 2, as for example binary and interval measure coding, might require different similarity coefficients so that the X ---> S processes would necessarily be different. Similar considerations hold for ordinations and the other procedures. DESIRABLE PROPERTIES OF CLASSIFICATION A variety of tests and criteria can be applied to measure the optimality of the several stages of the taxonomic process as discussed above. We list the most important ones below and briefly discuss some of their properties. We shall note whether these tests are limited to one or the other school of taxonomy and, if not, whether these criteria can be used to compare the relative merits of the schools as such or of various methods. The Fit of a Summary Representation to a Similarity Matrix Such fits for C, D, T, or M are frequently tested by matrix correlations. This is a deliberately vague term referring to correlations between any pair composed of the following: an original resemblance matrix among OTUs, a resemblance matrix computed from a classification, a tree,

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 465 0--->X-->3 >D >C 0 7---> X 1 )Si D i > Ci 'X2 > S2 -------> D2 ----> 02 0 -----> X > S --> D1 C 1 S 2 ---> D 2 ---> 02 0 -->x C1 D2 ---> C2 > Ci C2 FIG.

7 1981 COMPARING NUMERICAL TAXONOMIC STUDIES >X-->3 >D >C > X 1 )Si D i > Ci 'X2 > S > D2 ----> > X > S --> D1 C 1 S 2 ---> D 2 ---> >x C1 D2 ---> C2 > Ci C2 FIG. 2. Comparing methods of numerical classification. The uppermost diagram illustrates the flow of taxonomic procedures as illustrated in Figure 1 in simplified form. The diagram immediately below shows alternative methods of coding (an 0 > X procedure), with other procedures kept constant. The next two diagrams illustrate alternative methods of computing similarity coefficients and dendrograms, whereas the last diagram illustrates alternative classificatory procedures. The subscripts refer to the alternative procedures. Note that in a case such as the second diagram, only the 0 --> X procedure (coding) differs between the two classifications C, and C2, as indicated by the dashed arrow from 0 to X2. The similarity coefficient and classificatory procedure used are identical, but may yield different results, here indicated by the differing subscripts, because of the change in coding. an ordination, or any other method of summarizing relationships among OTUs. Thus matrix correlations can be S1 S2, S C, C1 C2, C T, C M, etc. measures; when applied to an S C relationship they are known as cophenetic correlations. Although product-moment correlation coefficients have been widely used to compare matrices and despite having developed this approach (Sokal and Rohlf, 1962), we are not convinced that it is necessarily the best way to go about establishing the desired correspondences. Various metric and nonmetric techniques may well be more suited to the tasks and may avoid some of the undesirable aspects of matrix correlations. But since much of the work carried out to date is based on matrix correlations, we need to examine their implications when they are applied in relevant tests. A cophenetic correlation is an estimate of the strength of the relationship between an ultrametric representation of the resemblance among the OTUs implied by a classification C, and the original similarity matrix S. A cophenetic correlation is a criterion of optimality only to the degree that it measures how well a hierarchic arrangement of OTUs corresponds to the original similarity matrix from which it has been derived. However, S1 S2 and C1 C2 measures estimate different relationships. The S1 S2 measures estimate how well two similarity matrices correspond to each other, the C1 C2 measures estimate the same for two classifications (ultrametrics). Inherently neither is a measure of any type of optimality, although they may be useful in tests of congruence or of nonspecificity. Cophenetic correlations have been routinely employed in phenetic taxonomy. The more a classification reflects the similarity matrix on which it is based, the more inherently desirable it would seem. Thus in phenetics a method that maximizes cophenetic correlations or other measures of fit will be optimal by an important criterion. A still better fit to the original similarity matrix can always be found if one is willing to use more parameters to describe the taxonomic structure and to sacrifice the ultrametric structure isomorphic to a Linnean classification. Thus, as we shall point out below, some

466 SYSTEMATIC ZOOLOGY VOL. 30 of the techniques of numerical cladistics may yield better fits to similarity matrices than standard phenetic clustering methods.

8 466 SYSTEMATIC ZOOLOGY VOL. 30 of the techniques of numerical cladistics may yield better fits to similarity matrices than standard phenetic clustering methods. Yet when such findings have been reported (Farris, 1977, 1979a, b) the "cladistic- summary representations employed usually have not been ultrametrics and thus each does not correspond to a unique classification, e.g., the reported improvements were for S-T over S-C measures rather than for one S-C measure over another. Since the aim of a strictly cladistic classification is not primarily to reflect similarities between pairs of OTUs but rather to estimate genealogical relations, matrix correlations are not especially appropriate measures for comparing a cladogram with its data or similarity matrix. As an alternative to matrix correlations, one may fit the resulting classification to the original data matrix as suggested by Duncan and Estabrook (1976), effectively employing an X-C measure. Stability We believe that taxonomists would generally agree, that if all other considerations were equal, methods yielding more stable classifications should be preferred over those leading to less stable ones. However, stability is not a good optimality criterion by itself since one easily achieves a perfectly stable classification by simply not making it depend upon the data (an alphabetical arrangement does not change no matter what characters are used to describe the OTUs). Rohlf and Sokal (1980) have considered the concept of stability with respect to classifications. The following three aspects are of interest here. One kind of stability is the robustness of a classification to the addition of new characters or to different selections of characters. Methods that yield more congruent classifications will be considered more stable. By congruence we mean agreement of separate classifications arrived at by.the same algorithms (phenetic or cladistic) and based on the same set of OTUs but on different sets of characters. If the different sets of characters are somehow randomly chosen, then such a measure of congruence reflects an aspect of the stability of the method. If, by contrast, the sets of characters represent different classes of characters, such as external vs internal characters, male vs female characters, biochemical versus morphological characters and the like, then the test of stability is confounded with a test of the congruence of classifications based on these different kinds of characters (Rohlf and Sokal, 1980). In turn such congruence is related to the nonspecificity hypothesis (Sneath and Sokal, 1973:97). For randomly chosen characters, comparison of phenetics with cladistics based on congruences of the same character sets is legitimate. It would test the sampling error from the universe of characters. This error presumably should diminish as the sample size of each set of characters increases. We would predict that the phenetic methods will be more congruent with randomly chosen characters, although not by much, since they are sampled from the entire phenome rather than a subset representing shared derived similarity (synapomorphy). By contrast, when different classes of characters are employed phenetic techniques should not reach perfect congruence even when large numbers of characters are employed for each class. This is so because the different classes of characters frequently reflect different adaptations. Ideally, cladistic techniques should yield fully congruent cladograms since there is only one true cladogram for a given set of taxa regardless of the set of characters on which it is based. However, a cladogram constructed from a given set of characters is only an estimate of the true cladogram and is subject to errors due to the sampling of characters employed, errors in the determination of their states, and to any defects in the algorithm used to construct the estimated cladogram. Mickevich (1978) found in practice that cladograms constructed from different sets of characters differed appreciably.

9 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 467 Robustness to the addition of OTUs is a second kind of stability. One hopes to establish classifications whose essential structure would not change upon discovery and incorporation of a few new OTUs. While in principle this should be possible for ordinations (new OTUs are simply new points placed into the space), even in such cases the 2- or 3-dimensional reduced space is usually altered somewhat as new OTUs are added. Certain clustering algorithms are especially sensitive to the addition of new points to the study. The branching patterns of minimum length trees are similarly sensitive to the addition of OTUs. These problems are compounded when the added OTUs are intermediate between what were distinct groups. There have been few studies on the effects of adding new OTUs to classifications (but see Crovello, 1968). Tests of robustness to addition (or deletion) of OTUs require measures for comparing classifications that allow for absence of complete overlap between the sets of objects being compared. This aspect of stability can be tested in phenetic as well as cladistic classifications and it would seem that valid comparisons between the taxonomic schools could be made for this criterion. However, it might be argued that complete stability upon the addition of OTUs is an inherently unattainable (and undesirable) goal. The degree of stability will necessarily depend on the position of the new OTU in the established taxonomic structure. Adding another species to an existing species cluster will probably not disturb the taxonomic structure appreciably. Adding OTUs that are intermediate between established taxa may cause far more rearrangements in the classification. Stability of similarity matrices with respect to differences in character coding would be a third aspect of stability of classifications. The consequences of changes in character coding would be assayed in the resulting similarity matrix. One would like, therefore, to obtain resemblance coefficient formulas that are relatively invariant to differences in character coding. There has been relatively little work on this subject in phenetic taxonomy. Much of the information obtained from the original organism and furnished in the data matrix is coded under conventions that are frequently arbitrary. Similar considerations apply to cladistic work except there the comparison might possibly not be between similarity matrices but between parsimony solutions based on different character state codings. If a comparable scale can be found, one might wish to compare the correspondence of similarity matrices resulting from different character codings with the correspondence of cladograms resulting from different codings of the data matrix. These different codings need not necessarily be the same since some of the present phenetic and cladistic techniques have somewhat different requirements. General Utility This criterion, albeit hard to define, lies at the base of the phenetic concept of natural classifications (see Sneath and Sokal, 1973:24 for a discussion). Pheneticists argue that classification based on high within-group similarity in as many features as possible, while not perfect for any one purpose, will be widely useful for many scientific purposes since similar OTUs and taxa are placed together. High within-taxon similarity is closely related to character state predictivity, discussed immediately below. In cladistic taxonomy natural taxa are monophyletic groups and high within-group similarity is not a necessary requirement. Yet if a cladistic method were to produce classifications with a higher intrataxon similarity than those produced by phenetic clustering of the same similarity matrix, this would surely be an interesting although unexpected result. There is general agreement that high predictivity is one of the attributes of a good classification. Predictivity is defined as some measure of the extent to which an OTU is similar to the other

468 SYSTEMATIC ZOOLOGY VOL. 30 OTUs in the same taxon. For character predictivity to be high, characters must be relatively homogeneous within taxa but differ among taxa.

10 468 SYSTEMATIC ZOOLOGY VOL. 30 OTUs in the same taxon. For character predictivity to be high, characters must be relatively homogeneous within taxa but differ among taxa. Hence the relation with intrataxon similarity discussed above. Various attempts are currently under way (e.g., Archie, 1980) to establish a generally acceptable measure. Measures of predictivity are X C comparisons. The definition of predictivity presents numerical taxonomists with several dilemmas. Presumably one should compute an average predictivity over all characters, but should this be computed separately for each taxon or globally for the entire classification? Also, whereas it is relatively simple to compute such a measure as a homogeneity function over a simple partition of the data (e.g., Gower, 1974), it is less obvious how it should be computed for a hierarchical classification (Archie, 1980). Some balance between classificatory detail and homogeneity of character states should be achieved. In the extreme case of a classification involving the disjoint partition of the entire set of OTUs, perfect homogeneity and prediction of character states is possible but of little taxonomic interest. A measure of predictivity must therefore allow for the several taxonomic levels of a given study, since any one character might be highly predictive at a high categorical level, as for example that corresponding to major groups of the study, but might be of little value in differentiating the small groups into which these major groups are divided in the classification; or the converse relationship may hold. A second important consideration is whether predictivity should measure only errors of inclusion or both errors of inclusion and those of exclusion. Here the phenetic and cladistic schools diverge clearly. Pheneticists do not define a natural classification as "that classification whose constituent groups describe the distribution among organisms of as many features as possible" (Farris, 1977:829), nor would they consider a classification to be "natural" just because "each of the characters is represented by a cluster" (Farris, 1979a:201). These may well be cladistic definitions of naturalness but none of the statements defining a Gilmour natural taxon restrict the occurrence of identical character states outside a taxon. Pheneticists have never claimed that each taxon is distinguished by a unique character state or even by predominantly unique character states, by contrast with some cladistic models where such definitions have been used. Citations to the contrary (Farris, 1977, 1979b, citing Sneath, 1961) are misinterpreted. Sneath (1961:122) in discussing the definition of a natural taxon gives the example of a group containing mice, rabbits, and horses, which he states "is not itself a natural taxon. It is only a part of a taxon part of the taxon Mammalia. Quite clearly a natural taxon embraces all of the organisms whose overall similarity comes within certain limits." Nothing is said here, or elsewhere in Sneath's paper, about any one character or set of characters being restricted to a taxon for the classification to be natural. Measures of predictivity can be applied to characters used in establishing the classification as well as to characters not so employed, which are tested against the existing classification. We expect phenetic classification to have the higher average predictivity when tested against the characters on which they are based, but to have lower predictivity when tested against an alternative sample of characters not used to establish the original classification. Cladistic classifications should have lower overall predictivity than phenetic ones because characters exhibiting homoplastic similarity should be incompatible with the former. To be useful, some taxonomists might require classifications to be equitable in the distribution of taxon sizes, that is, to contain few monotypic and few highly speciose taxa. Such a criterion goes counter to the principles of phenetics or of cladistics since in the former one would prefer to obtain a faithful repre-

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 469 sentation to phenetic structure and in the latter one would prefer an estimate of cladogeny, regardless of whether this results in an imbalanced

11 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 469 sentation to phenetic structure and in the latter one would prefer an estimate of cladogeny, regardless of whether this results in an imbalanced distribution of OTUs into taxa. The common occurrence of the Willis hollow curve (Sneath and Sokal, 1973:306) is a strong argument against deliberately equalizing taxon sizes. But if such a criterion were desirable, it could be measured (again at various hierarchic levels) in both phenetic and cladistic classifications and these classifications could be compared for this criterion. Some authors, e.g., Schuh and Polhemus (1980), seem to equate several of the criteria enumerated above. Naturalness, stability, predictivity and congruence although not defined are considered equivalent by these authors, who suggest that maximizing any one of these criteria would maximize the others. We know of no support for this contention at this time. Fit to a Known Cladistic Relationship In cladistic taxonomy, if the true cladogram were known, the fit to the true cladogram of a cladogram estimated from a data matrix or a similarity matrix could be used as a measure of quality of the methodology employed. Various methods of comparisons could be invoked. There could be C C comparisons between the classification implied by both cladograms, the true and the estimated, or, similarly, T T comparisons between the tree topologies (with or without allowing for branch length). Although phenetic methods are not intended to describe phylogenies, classifications based on phenograms could also be compared to that based on the true cladogram. Somewhat counterintuitively, one would expect data sets with much homoplasy to yield phenetic classifications that approximate those based on estimated cladograms reasonably well. This is so because the resultant noise in the data would affect both approaches more or less equally. By contrast, data sets with little homoplasy could yield quite different phenetic and cladistic classifications because of inherent differences in the structure of the approaches. However one would hope that cladistic methods would provide closer estimates of true evolutionary trees. Phylogenetic Optimality Criteria An important class of numerical phylogenetic methods, based on the principle of parsimony, aims at minimum length trees. These differ depending on whether the reversibility of character states is permitted in determining the length of the trees and also on various assumptions about the nature of the trees (e.g., Wagner trees versus Camin-Sokal trees). We must emphasize a distinction which has been rarely made in the literature: the distinction between the goal of a classificatory method and its numerical approximation. Thus the term Wagner tree has been used interchangeably for the goal of the Wagner method, that is, a minimum length Steiner tree, and for the results of the output of various computer algorithms that attempt to obtain such a tree; such results may only approximate the actual minimum length tree for a given data set. As will be seen below, many so-called Wagner trees typically are not minimum length in terms of their own criterion. A similar consideration applies to Camin-Sokal trees. If a Camin-Sokal tree is defined as the minimum length directed tree in a Manhattan metric, then the results obtained by the algorithm of Gamin and Sokal (1965) may only be approximations of true Camin-Sokal trees. Ideally one should distinguish the approximation from the goal by a different name, but it is unrealistic to expect an entirely new terminology at this stage in the development of the field. We therefore propose that terms such as Wagner tree, Gamin-Sokal tree and the like, be reserved for the goals and that the adjective approximate precede all estimates of such trees obtained as a result of heuristic computer algorithms. Parenthetically we might add a further note of caution to the uncritical users of numerical

12 470 SYSTEMATIC ZOOLOGY VOL. 30 phylogenetic methods. It must not only be remembered that the output of a numerical phylogenetic algorithm is an approximation to a goal, but even that solution, when obtainable, would only be an estimate of the true phylogeny of the organisms under study. Although suggestions have been made from time to time to interpret phenograms in a cladistic manner and thus to measure the degree to which phenograms are minimum length trees, this approach should be futile since Wagner trees are by definition minimum length. It may, however, be of interest to know how well phenograms approximate estimates of minimum length trees. Compatibility cliques (Estabrook, 1972; LeQuesne, 1972) are an alternative approach. A tree based on a large clique of compatible characters would be preferred to one constructed automatically from incompatible characters. Yet another cladistic approach is the maximum likelihood estimation technique (Farris, 1973; Felsenstein, 1973). Trees with the highest likelihood under a given probability model are preferred. Each of these techniques leads to an optimality criterion. SOME PUBLISHED COMPARISONS We now turn to some recently published comparisons involving numerical taxonomic studies and examine the appropriateness of these comparisons in the framework of reference presented earlier. Tests of Taxonomic Congruence Mickevich (1978) tested taxonomic congruence between pairs of classifications based on different suites of characters. The classifications were obtained by techniques used by the different taxonomic schools. These studies were carried out for nine taxonomic data sets. For each data set successive character suites belonging to biologically different classes of characters, such as larval versus adult characters, male versus female characters or morphological versus allelic characters were used. The inappropriateness of the techniques employed by her have already been discussed by us (Rohlf and Sokal, 1980, see also Mickevich, 1980, for a rejoinder). As stressed earlier in this paper there are several different aspects of taxonomic stability and Mickevich addressed only one of these stability on the addition of taxonomic characters. She did this by tests of taxonomic congruence which examined incongruence due to different classes of characters rather than incongruence due to the sampling of characters. Since the nature of the character classes varies among the data sets tested by her, this design confounds error due to the differential involvement of the two classes of characters in the adaptive diversity of the taxon with that due to random sampling error. As we have also pointed out in this paper, if different classes of characters represent different adaptations (as possibly between larvae and adults, or males and females), then reliable phenetic methods should yield different classifications based on these different character classes. By contrast, cladistic classifications based on different classes of characters should be more similar to each other than phenetic ones since there is only a single genealogy for a group of organisms. Whether cladistic classifications will be more stable in practice will depend on the amount of noise in the data and the particular algorithms used. Testing the relative stability of phenetic versus cladistic classifications by a design expected to favor the cladistic method appears to us to be a biased approach. In the first comparative study based on random partitions of character suites, Sokal and Rohlf (1981) found that in one data set, UPGMA, as employed in phenetics, was significantly more stable than the Wagner procedure, employed in cladistics, whereas there was a tendency in the opposite direction in the second data set which is a subset of the first. We are also concerned with examining the nature of the comparison undertaken

13 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 471 by Mickevich (1978) and its suitability to the question asked. Tests of congruence could be based upon comparisons of the resulting similarity matrices, trees, dendrograms or of the resultant classifications. Mickevich stated that she chose to test congruence on the resulting classifications, a decision with which we would concur. However, she actually compared unrooted trees rather than classifications in her 1978 study. In her revised study (Mickevich, 1980) she has rooted the trees but also introduced a serious new bias. She chose the roots of the Wagner trees so as to achieve maximal agreement between the pairs of trees for each data set rather than estimate the root independently for each tree. Since she stated that her procedure "... will naturally... produce trees with high consensus information," she is clearly aware of this bias. Before subjecting each character suite of a data set to a variety of procedures she used a single distance measure, "to keep the analyses as comparable as possible" (Mickevich, 1978:146). This procedure is outlined in the scheme shown in Figure 3. A common method of coding and similarity matrix computation precedes the divergence of the methods of representation. Thus by carrying out the tests in this manner, each method was tested with data sets and algorithms that are not necessarily the most suitable for the representational (S > T or S > D --> T) method. In response to our criticism (Rohlf and Sokal, 1980) of her original distance metric, Mickevich (1980) employed taxonomic distances based on standardized characters, but this method (while suitable in some cases) cannot be universally applied to all the data sets analyzed by her. For morphometric data, this procedure is sensitive to size differences, and correlations would have been used by most pheneticists. For electromorphic and molecular data, the standardization was unnecessary and distorts the data as a function of the distribution of character states encountered. We find it difficult to justify standardization of binary characters that represent states of OA X4 --> SAz DA B--> X13 T1 A T2 A DB >T1B T2 B" FIG. 3. Scheme illustrating the test of congruence carried out by Mickevich (1978). The general conventions of this figure are as in previous figures. Letter subscripts refer to distinct character sets. The curved lines at the right of the figure indicate the pair of summarizations for which congruence tests are being performed. For further explanation see text. additive or nonadditive binary coded data. It is our contention that the appropriate technique for carrying out such a comparison is to employ suitable methods of coding and of similarity coefficient computation for the given classificatory method, followed by comparison of the resulting classifications. Such a scheme is outlined in Figure 4. A test of congruence based on such a design has recently been carried out by Schuh and Polhemus (1980). Its scheme is furnished in Figure 5. It is unfortunate that this correct design is flawed by inappropriate and inaccurate computation. Two sets of characters were obtained by a single random partitioning of the entire suite of characters. There should have been an adequate sample of such partitionings in order to compute the average degree of congruence. With only one random partitioning it is impossible to state how representative such a single result is. A further shortcoming of their computations is that the multistate characters are coded improperly for the computation of taxonomic or Manhattan distance coefficients (at least for phenetic studies). A second type of comparison, also shown in Figure 5, is between cophenetic correlations for phenograms and for "cladograms" following the method of Farris (1979b). Each of these is indicated by curved lines in Figure 5. The inappro-

$472 SYSTEMATIC ZOOLOGY vol. 30 OA, X1 A X2A S1 A (or T24) ---- C2A t.--- XA--> S1A ). C I A s\ T2A--> C2A) X B S B B OBE. X2 B S2B C2B (or T2g) FIG. 4. A recommended procedure for congruence.$

14 472 SYSTEMATIC ZOOLOGY vol. 30 OA, X1 A X2A S1 A (or T24) ---- C2A t.--- XA--> S1A ). C I A s\ T2A--> C2A) X B S B B OBE. X2 B S2B C2B (or T2g) FIG. 4. A recommended procedure for congruence. This scheme illustrates our recommendations for the correct way of carrying out a test of congruence. Letter subscripts refer to distinct character sets. The curved lines at the right of the figure indicate the pair of summarizations for which congruence tests are being performed. To simplify the diagram, D IA has been omitted from the path SlA C IA and similarly for D2A. S 2A or T2A are alternative results of the taxonomic procedure. T2A could also be inserted as an additional step as follows: --> S 2A --> T2A C SA and similarly for B. For further explanation see text. priateness of such a comparison is explained in the next section. Thus the validity of results of Schuh and Polhemus (1980) are suspect, despite the generally correct design (Sokal and Rohlf, 1981). Our major criticisms are not addressed in a rejoinder by Schuh and Farris (1981). Tests of Naturalness Several comparisons of the relative merits of phenetic versus cladistic classifications were carried out by Farris (1977, 1979a, b). The design of the 1977 study is illustrated in Figure 6A (note that here and in subsequent diagrams of Figure 6 we omit D from the sequence S ---> D ---> C to simplify the diagrams and discussion). The study compares how well a classification C, obtained by an. S ---> C operation, recovers S,, a phenetic similarity matrix, with how well a second hierarchy C2 resembles a \- ) T > C2B FIG. 5. Scheme illustrating the test of congruence carried out by Schuh and Polhemus (1980). The symbolism is as for earlier figures. C IA and C 2A are alternative results of the taxonomic procedure. They are compared using the number of common "informative" components (sensu Nelson, 1979). In addition to comparing the summarizations at the right of the figure (curved lines), comparison was also made between the cophenetic correlations SiACiA and S IA T2A. The latter comparisons are similar to those of Figure 6E. special" similarity matrix S2 from which it was obtained. Both similarity matrices are based on the same artificially constructed data matrix X. If the cophenetic correlation for the S2 C 2 comparison is greater than that for the S1 C, comparison, it demonstrates that by this measure, the matrix S2 is more accurately represented by a hierarchy (ultrametric); we may say it is inherently more hierarchic than matrix S1. The relevance of this to phenetic taxonomy is not clear. If in a given study S is considered to correspond to phenetic similarity then one of the attributes of a good phenetic classification would be that similar OTUs (as defined by Si) are placed in the same taxa. If the correlation for the S1 C 2 comparison is low, then the magnitude of the 52 C2 comparison is not of much interest in phenetic taxonomy. One can always achieve a perfect correlation for the S2 C2 comparison if one is not concerned with the strength of the 5, S2 or S1 C 2 comparisons (this can be done by simply defining S2 to be a matrix with perfect ultrametric structure with no necessary relationship to any data; Janowitz, 1979). If, by contrast, the goals of the study are

15 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 473 A. X --> Si > CI D. Si ---> C l S2 C 2 S2 --> C 2 T2 B. --> Si --> C 1 E. S2 C 2 A A' C. D F. S ---> C FIG. 6. Schemes of several comparisons between phenetic techniques and various methods proposed by J. S. Farris. The symbolism is that of the earlier figures. In diagrams A, B, D, and F the dendrogram step D has been omitted from the paths S C for simplification. A. Design of the comparison reported by Farris (1977), X is an artificial data matrix. The comparison was made between cophenetic correlations S1C1 and S2C 2. B. A second design reported by Farris (1977). S1 is a dissimilarity matrix on real data, S2 is a similarity matrix obtained by a special transformation of S1. The comparison made is between the cophenetic correlations SIC, and 52C2. C. Scheme of the same design illustrated in B, but employing the symbolism of Farris (1979a) which is shown in boldface in the text. D. Design reported by Farris (1979a). This is similar to the design in B, except that unrooted trees T1 and Try are obtained from cophenetic matrices C1 and C2, respectively. The comparison is made between the matrix correlation S,T2 and cophenetic correlation S1C1 and also between the matrix correlation S2T1 and the cophenetic correlation 52C 2. E. Scheme of the same design illustrated in D, but employing the symbolism of Farris (1979a) which is shown in boldface in the text and here. F. Design of comparison employed by Farris (1979b). The comparison is being made between ST matrix correlations and SC cophenetic correlations. The tree structure T is an unrooted tree obtained by the distance Wagner procedure modified so as to optimize the ST correlation. to form taxa in which similar OTUs (defined as in S1) are placed together, then one would not want to distort the overall similarity relationships. Showing that the cophenetic correlation coefficient for the S2-C2 comparison is higher than that for the S c-c, comparison, simply shows that S2 is more compatible with a hierarchy, than is S1. It does not of itself imply that C2 is more.desirable than C1. Even if the comparisons were meaningful, the results obtained by Farris (1977) with the artificial data set depend on the exact replication of each character employed by him. Even the slightest departure from that replication scheme yields results in which the two types of similarity show little difference (P. H. A. Sneath, D. H. Colless; personal communication). Although Farris (1977) also reported results for real data, it was difficult to evaluate these since no information was given about the nature or source of the fifty sets

16 474 SYSTEMATIC ZOOLOGY VOL. 30 of data reported in that study. However, Table 2 of Farris (1979a), which lists results on ten data sets, does allow one to repeat his computations, to clarify precisely what comparisons were being made, and to try other methods on these same data sets. The design of the study of real data employed by Farris (1977) is the same as the first of the two approaches discussed in Farris (1979a). Both are described here in the terminology of his 1979a paper, to provide continuity for the reader, except that we have changed his symbols to boldface (since they represent matrices), to avoid confusion with our symbols for classificatory steps, especially between his D (a dissimilarity matrix) and our D (dendrogram). In Farris (1979a), the special similarity matrix, A, is computed by a transformation of the original dissimilarity matrix D, rather than directly from the original data matrix as in Farris (1977). Farris observed that the correlation between a matrix of "special" similarities, A and a matrix of cophenetic values A' obtained from a UPGMA clustering of A, is higher than the correlation between the original dissimilarity matrix, D, and the matrix of cophenetic values D' obtained from a UPGMA clustering of D. Since the various matrices are symmetric, by convention only the elements in a half matrix excluding diagonals are used to compute these matrix correlations. These results imply that the special similarity matrix is closer to an ultrametric than the original dissimilarity matrix. From these findings Farris concluded that phylogenetic methods were superior to phenetic methods by the criteria of the latter. Note that D and A correspond in our notation to S1 and S2, and that D' and A' correspond to C1 and C2, respectively (see Figures 6B and 6C). There are formal difficulties in comparing S1 C1, with S2 C2 measures, similar to the arguments presented earlier (see also Janowitz, 1979) but it is instructive to see what such a comparison.means in empirical terms. The D, D', A and A' matrices for the Xenopus data set (Farris, 1979a; Bisbee et al., 1977) are shown in Table 1. In this example, OTU 1 has been used arbitrarily as the reference OTU for computing A and A', as was done by Farris (1979a). Accordingly we label these matrices Al and A' 1, respectively, to indicate the OTU employed as the reference point. This choice results in the entire first row and column of the special similarity matrix Al being zeroed. This pattern guarantees that row 1 and column 1 of the ultrametric matrix A'1 resulting from a UPGMA clustering of Al will also be all zeros. Thus the correlation between A and A' would tend to be positive even if there were no other relationships between the values of the A and A' matrices (the largest elements of these matrices, a u and a'1, will also be identical when they correspond to mutually closest pairs of objects). Figure 7 shows a plot of the elements of Al against those of A'1 for the present example. The points occur in three groups. Those involving reference OTU 1 are at the origin, those corresponding to mutually closest pairs lie along the straight line a l = a'1, whereas the rest of the points range over the scattergram. It would seem reasonable to remove the reference OTU from the comparisons, since once it is selected, its values against all other OTUs are fixed at zero and thus do not depend upon the observed data. If one computes the correlation between matrices A and A' with row 1 and column 1 ignored (we shall designate such submatrices as Am and A'(1), respectively), then the cophenetic correlation AA' will be lower (it drops from to in this example). A second approach in Farris (1979a) leads to a new empirical result. The special similarity matrix Al is again obtained by transformation of the original phenetic dissimilarity matrix D. Each matrix is clustered by UPGMA and matrices of cophenetic values D' and A', respectively, are obtained. These cophenetic matrices are then transformed back into estimates of matrices of the other kind, D' into A", and A' into D", where the double primes denote estimates of the indicated matrix.

17 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 475 TABLE 1. D, D', A1, A'1 AND D"1 MATRICES FOR THE XENOPUS DATA SET. OTU D D' A, I A', 1 X 2 X X X X X D", Notes: Irr values in italics are identical to corresponding elements of D. The diagonal entries in the A', matrix are given as X; their actual values are positive, equal to or larger than the largest element in the matrix, and equal to each other. These requirements are necessary for this similarity matrix to obey the set of conditions equivalent to the ultrametric for dissimilarity matrices (see Appendix A). For explanation of other symbols see text. The transformation A' ---> D" turns the clustered special similarity matrix into an unrooted tree, with the elements of D" giving the path length distances of such a tree. Thus the entire procedure D ---> A ---> A' ---> D" corresponds to S, ---> S2 ---> D -->C--->T in our notation (actually in this case T is a function of both C and the diagonals of S2). Farris then computes matrix correlations (incorrectly called cophenetic correlations) DD" and AA" and compared these to the cophenetic correlations DD' and AA', respectively Al / / / FIG. 7.-Plot of elements of A, versus those of A', for the Xenopus data. OTU 1 was used as the reference OTU. Dashed line indicates locus of elements for which ao = a The circle at the origin indicates five separate points, all at the origin. He finds correlations DD" > DD' and AA" AA' and argues that since matrix correlations based on "special" similarity coefficients (AA') or those derived from these coefficients (DD") are greater than those based on or derived from "phenetic" matrices (AA" and DD'), phylogenetic techniques are superior to phenetic methods. That is, he claims that since correlations for the S -T comparisons are higher than those for the S-C comparisons, the phylogenetic techniques are superior to phenetic ones. The new design is illustrated in Figures 6D and 6E. In Figure 6D we have used the notation introduced by us in this paper. Figure 6E employs the notation of Farris (1979a), and we note that S, = D, C I = D', S2 = A, C2 = A', T = A", and T2 = D". Contrasting the matrix correlation for the comparison S 1-T2 to the cophenetic correlation for the 51-C1 comparison again is not appropriate when one is interested in comparing classifications. The cophenetic correlation for the 51-C1 tests how faithfully the ultrametric classificatory relationships C1 fit

18 476 SYSTEMATIC ZOOLOGY VOL. 30 the original similarity matrix S1. But matrices for T2 (obtained via the special similarity transformation S2 of S1 and the cophenetic values C 2 of S 2) are not ultrametrics (see Appendix A) and therefore are not equivalent to classifications. A similar argument obtains when one contrasts the S2-T1 with the S2-C2 comparisons. We find it difficult to perceive how such a contrast could demonstrate the advantage of one method or school of taxonomy over the other. But even though the comparisons do not speak to the cladistics-phenetics controversy, it is of considerable interest to discover why the phylogenetic methods should yield higher matrix correlations with the original phenetic dissimilarity matrix than the phenetic techniques employed. There is again a problem in assessing the amount of agreement between matrices D and D". Although the elements in row 1 and in column 1 of the D" matrix are not fixed at zero (as was the case for the A and A' matrices), the elements will always be identical to the corresponding elements of the D matrix. The distances of the reference OTU to the others will be perfectly encoded and then recovered by the special similarity transformation and its inverse. This is shown in Table 1 for D"1. Thus, again, it does not seem appropriate to include the values involving the reference point in the comparison. If row 1 and column 1 of the D and D" matrices are ignored, then the correlation between the reduced matrices, designated as Do) and D"(1), will necessarily be lowered (although in this example it drops only slightly, from to ). Since we recommend comparing correlations between the various derived matrices with the reference OTU i left out, we exclude the latter from the computation of the cophenetic correlation between the input dissimilarities and their UPGMA cophenetic values which are to be used as a standard. We have identified such correlations as DD'o). Farris (1977 and 1979a) arbitrarily used only the first OTU of each data set as the reference OTU. Farris (1979a) states: "In each case the reference point r was taken simply as the first terminal taxon of the data set. Other reference points might of course be used, and might yield somewhat different cophenetic correlations.- Since the choice of OTU 1 was arbitrary, we have examined the consequences of trying each OTU in turn as the reference point. The relative magnitudes of the various AA'0) and DD'u) correlations differ considerably from the DD'i and AA'i correlations reported by Farris. Incidentally, note that in our Tables (and in Appendix B) we report matrix correlation coefficients, as is conventional. Squared correlation coefficients (used by Farris, 1979a, b) tend to emphasize the (small) differences between correlations for DD" and DD', and for AA" and AA'. In Table 2, we have shown the various possible correlations for the Xenopus data set with each OTU serving in turn as a reference OTU. Only when OTU 6 is used as a reference are the correlations AA'i > DD'. The more proper comparison with the reference OTU excluded from the computations results in no case where correlations AA'a) > DD'(). These results lead to conclusions quite different from those of Farris (1977:838) who concluded that clustering by special similarity is generally superior to clustering by overall similarity (he reported that he had found no cases in which AA'i < DO' in the 50 data sets he examined). Although correlations DD"i are greater than DD' (as might be expected, given their common reference column and row), when we exclude the reference OTU we find that in only three cases is correlation DD" (i) > DD'o). To investigate further the properties of the special similarity transformation, we also applied it to the artificial data set of Table 5.1 in Senath and Sokal (1973). These data consist of 16 points plotted haphazardly onto a surface, and have no biological or evolutionary meaning. We found in this case the DD" u) correlations to be considerably higher than the Dif o) correlations, suggesting that the effec-

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 477 TABLE 2. MATRIX CORRELATIONS FOR VARIOUS MATRICES DEFINED IN THE TEXT COMPUTED FOR THE XENOPUS DATA. OVERALL COPHENETIC CORRELATION DD' = 0.9927.

19 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 477 TABLE 2. MATRIX CORRELATIONS FOR VARIOUS MATRICES DEFINED IN THE TEXT COMPUTED FOR THE XENOPUS DATA. OVERALL COPHENETIC CORRELATION DD' = Reference OTU DD", AA', DD", D(B2) = DM(2D) = Note: The F,-column gives the results of F-tests of the null hypothesis: correlation DD",,, = MY:, Critical F = D(B,) and DM(2D) are matrix correlations of the original dissimilarity matrix to Br-clustering (k = 2) and nonmetric multidimensional scaling in two dimensions, respectively. For explanation of other symbols see text. tiveness of the special similarity transformation is a mathematical property of the transformation and is not based on any "naturalness" of the data in an evolutionary or phylogenetic sense. Even when the distances involving the reference OTU are omitted from the computation of the matrix correlation, we would still expect D" to fit the original D matrix somewhat better than does D'. However, this is strictly due to a mathematical property of the transformations employed. When constructing the special similarity matrix A' and from it D", one estimates more parameters from the data than when one constructs the D' matrix of cophenetic values. An analogy from statistics may be helpful here. In multiple regression analysis one expects the addition of more independent variables (and thus more parameters, the partial regression coefficients) to increase the multiple correlation coefficient R2. Similarly, one expects the correlation between D and D" to be greater than that between D and D' since more parameters from D are employed in the computation of D" than of D'. On transforming D to A (for t OTUs), the parameters being estimated are the t - 1 distances between the reference OTU and the other OTUs. In addition, the cluster analysis of the A matrix (to produce A') estimates t - 1 clustering levels from that matrix for a total of 2t - 2 parameters. However, the level at which the reference OTU clus- ters with the others in the cluster analysis of the A matrix is defined to be the average of the elements in the first row or column (zero), so that there is actually one less parameter. Thus, there will be 2t - 3 parameters to be estimated from the t(t - 1)/2 dissimilarity values in matrix D. (For t = 3 these two quantities are equal which agrees with the fact that A = A' and D = D" when there are only three OTUs.) Since the cophenetic matrix D' is fitted from only t - 1 parameters of matrix D, one would on the average expect the correlation DD" to be greater than that for DD'. For the same reasons, we expect correlations AA' > DD', on the average. Examining these relations once more in Table 2 we note that the correlation DD"(,, ) is greater than that for DD'u) in three of six cases, equal in one, and less in two. In no case are these differences substantial. However, against expectation, the correlation AA'o) is less than that for DD'(i) for all six reference OTUs. Can one develop an approximate significance test for the difference between DD"(i) and DD'o? In multiple regression analysis, the empirical decision to employ a model which incorporates additional parameters is justified if the increase in prediction (increase in R2), resulting from use of the more complicated model, is larger than one would expect just due to chance. If we could assume that the t(t - 1)/2 input distance values were in-

478 SYSTEMATIC ZOOLOGY VOL. 30 TABLE 3. INEQUALITIES OF MATRIX AND COPHENETIC CORRELATIONS IN ELEVEN DATA SETS, TEN REAL AND ONE ARTIFICIAL. Data Sets Source No. OTUs No. of Correlations No.

20 478 SYSTEMATIC ZOOLOGY VOL. 30 TABLE 3. INEQUALITIES OF MATRIX AND COPHENETIC CORRELATIONS IN ELEVEN DATA SETS, TEN REAL AND ONE ARTIFICIAL. Data Sets Source No. OTUs No. of Correlations No. of Significant Differences > > DD",,) > Xenopus (Bisbee et al., 1977) Rana (Wallace et al., 1973) Pines (Prager et al., 1976a) Hyla (Maxson and Wilson, 1975) Bird Transferrins (Ho et al., 1976; Prager et al., 1976b) Bird Lysozymes (Jolles et al., 1976) Drosophila (Lakovaara et al., 1972) Primates (Benveniste et al., 1972) Rana (Case, 1978) (1)* Seven Carnivores (Sarich, 1969a, b; Farris, 1972) Average Proportion of OTUs showing the inequality Table 5.1 (Sneath and Sokal, 1973) (15)* * Results shown for comparison. Actual significance test used is approximate. Significance tests performed at 5% level. Note: For explanation of symbols see text. dependent and normally distributed, we could perform a least squares test of significance. This assumption is perhaps not unreasonable for these biochemical and immunological data sets, in which each distance is determined experimentally; the only exception to this is the data on Rana from Case (1978), which are based on a data matrix. The approximate F-statistic will be a ratio of two mean squares, one based on the residuals (D D") from the more complete model and the other based on the change in the variance of the residuals that one observes in going from a simpler model (D') to the more complex model (D"). Specifically: ' DD" ' DD' F, t' 1 1 r2dd, t'(t' 1)/2 2t' + 3 where t' = t 1, an adjustment for the fact that we are computing the correlations with the reference OTU ignored, thus reducing the size of the matrix by unity (Carroll and Chang, 1973, proposed the same statistic). The results for Xenopus are shown in the last column in Table 2. None of the DD"o) correlations are significantly greater than those for DD'a). The computations and comparisons discussed above were carried out for all of the data sets investigated by Farris (1979a). The results are furnished in Appendix B and summarized in Table 3. There is considerable variability in the results depending upon the data set. The inequality of correlations DD"0) Mira), expected on mathematical grounds, is found in of the comparisons on the average; the inequality of correlations AA'a) a DD'a) in of the comparisons. Only four of the ten data sets show any significantly higher correlation for DD" a) than for DD'u) and of these only the Hyla and the Rana (Wallace et al., 1973) data sets have significantly higher correlations for more than half of the reference OTUs. Interestingly, the artificial, nonbiological data set (Table 5.1, Sneath and Sokal, 1973) shows the most pronounced tendency toward a better fit by special similarity, the phylogenetic methodology. However, the F-test is not appropriate for this data set since the distances are not independent. We have concluded earlier that the

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 479 comparisons as carried out by Farris (1977, 1979a) do not address a meaningful question in taxonomy.

21 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 479 comparisons as carried out by Farris (1977, 1979a) do not address a meaningful question in taxonomy. Indeed, the comparison of DD" with DD' is inappropriate if for no other reason than the following. Whereas the cophenetic matrix D' is an ultrametric and hence corresponds to a classification, D" is not ultrametric (see Appendix A) and thus does not correspond to a classification. Such a comparison is between an S-T measure and an S-C measure. If we were prepared to accept non-linnean classifications (see DuPraw, 1964, 1965), i.e., nonultrametric solutions, we could have found equally good or better fits to D than that of D" by Bk-clustering (Jardine and Sibson, 1971) or by nonmetric multidimensional scaling. In the analyses reported in Table 2 and in Appendix B we have computed these solutions using Jardine and Sibson's overlap parameter k = 2 and 2-dimensional scaling. For the Xenopus data, these two methods yield matrix correlations of and , respectively, whereas the highest correlation DD"() equals But for other data sets these parameters yield matrix correlations lower than DD". In such cases our solutions could easily have been improved by increasing k and raising the dimensionality of the scaling procedure. Some pheneticists, ourselves included, are prepared to consider nonultrametric solutions; but this approach violates basic rules of Linnean taxonomy as well as of cladistic taxonomy and is rejected firmly by cladists e.g., Eldredge and Cracraft (1980) and Farris (1977). But even if one accepts his comparisons as being meaningful, our empirical findings do not lead to the conclusions reached by Farris (1979a) who limited his analyses to using OTU 1 as a reference point. When all OTUs are used as reference points in turn, and when the comparison is made more equitable by eliminating the reference OTU from the two matrices, there is only a slight and nonsignificant trend toward a better fit for special similarity matrices. Claims by Farris (1979a) that "The hypotheses of stochastic equality between DD' and DD" are comfortably rejected at the conventional 5 percent two-tailed error rate by a Sign Test...- and "The observation that DD" consistently exceeds DD' allows rejection of the phenetic hypothesis with a degree of confidence very near indeed to absolute certainty" are thus not supported on closer scrutiny. In fact, readers can easily convince themselves that, had Farris used OTUs 2, 3, or 5 as arbitrary reference points rather than OTU 1, up to four of the ten data sets would have yielded results opposed to those he reported for the DD"i versus DD' comparison (of course even more data sets show opposing results when one eliminates the reference OTU from the matrices being compared). Farris (1979b) introduced an additional comparison. In that study he compared the cophenetic correlation between a distance matrix D and its UPGMA cophenetic matrix D' with the matrix correlation between the same distance matrix and the distances implied by an unrooted Wagner tree. We represent this comparison schematically in Figure 6F. In this comparison the distance Wagner procedure (Farris, 1972) was modified to optimize the matrix correlation between the path-length distances of pairs of OTUs along branches of the tree and the original dissimilarity values (thus it is a leastsquares best fitting tree). His Table 1 appears to show that a phylogenetic method was superior in all cases to a phenetic method. Once more the comparison is problematic. It is not appropriate to compare an S-C measure with an S-T measure when one wishes to determine "a classification with maximal content of distance information" [emphasis ours] (Farris, 1979b:495). The unrooted modified Wagner tree is simply not a classification. Also, since the least squares tree does not seem to correspond to the concepts of parsimony, compatibility, or maximum likelihood one may also question whether it is a phylogenetic method. The correlations obtained by Farris are

22 480 SYSTEMATIC ZOOLOGY VOL. 30 reproduced for convenience in the column labeled DDw in our Table 4. Since more parameters are being fitted one can again ask whether or not the additional parameters employed by Farris are useful, i.e., whether they explain a significant added proportion of the variance. F-statistics computed as before are furnished, as are degrees of freedom for each of these data sets. The increase in correlation was significant at the 5 percent level in seven of the ten data sets examined. The column labeled DDLS in Table 4 gives the matrix correlation between D and the least squares best fitting unrooted tree to the distance matrix. It is interesting that in most cases our correlations are larger than those reported by Farris (1979b). Thus the evidence is actually stronger than reported by him that unrooted trees can yield better fits than ultrametrics to a distance matrix. Farris' modified Wagner procedure apparently did not obtain the optimal trees. Since there was this discrepancy we also investigated the matrix correlations for the minimum length trees (in column DDML) as well as for minimum length tree topologies in which we adjusted the branch lengths so as to maximize the correlation for that given tree topology (in column DDMLA). The UPGMA phenogram corresponding to the cophenetic values matrix D', the tree corresponding to D", the minimum length tree DDML, and the least squares best fitting tree DDLS are given in Figures 8A-8D for the Xenopus data. Interesting as these results are, they are not really relevant to the decision as to which technique is best able to yield a classification which preserves information about the original dissimilarities: To answer this question we converted the unrooted trees into classifications expressed as ultrametric matrices (see Appendix A). Since any root could have been chosen, we selected that rooting and that set of levels which maximized the cophenetic correlation for the given tree topology. It should be emphasized that the rooting procedure results in co- E. E. a z a z E. E. a 0 :0; U U 0,a <c <c 7,7, 7 7, 7 7, 7 7 7, uc.) c.;c.)cic.)c.;c.;u Co CD 100 Co CD t- OC N CO CA t-in CO CO CA D D 0 CD D C D D D V Cl 10 Co t- co cq CD CO CD.V CO t-,14 cq CD CD Cl CD QD CD CD OD CD CD CD CD CD CD 00 CD 0., Co 00 C t C - D CDD - D C D C C CD D D D D C C ir--) Go. co a 0 10) co O.. CO CD t CD Cl t- CD QD CD CD Cl UD Co CD.. CD 10 CD CD CD OD CD CD CD CD CD CD CD If) , CD. a OCi Cj X17)6) oo co 00 Co.. CO t- Co D 0 CO 0 C CZ 00 CO 00 CO Cr) t- CD NO CD CD CD CD CD CD CD CD CD CD L CO CO 0- t- 170 CD t- CA t- CD QD t- CD CD 10 ON CO Cl.. CD CA CD 10 t- CD CD 00 CD CD CD CD CD 00. CD p ca QD , bo CD CA CO CO.. CD 10) )0) 0-10: * * * * * * * 00,t CO 6 GI CZ CO,1" 000) CO Co 00 CD CD CD CO CO CD..,4 C0i od 4 00 Cl 1-1 Cq t- CD CD CD QD QD ca CD t- CD CA 00 CD CD CA 00 QD CD QD CD t- CD CD CD CD CD CD CD CD CD CD CD ets a trial-and-error procedure was used to minimize tree length or maximize least squares fit. For the other data sets the solutions were obtained by enumeration and are

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 481 phenetic correlation coefficients which will be higher than those which one would usually obtain if the trees were rooted using other criteria (such as

23 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 481 phenetic correlation coefficients which will be higher than those which one would usually obtain if the trees were rooted using other criteria (such as by the use of an outgroup or by the use of the midpoint criterion). Our rooting procedure is thus biased in favor of the phylogenetic methods. Figures 8E and 8F show examples for the Xenopus data set. The cophenetic correlations for rooted least squares fitting trees and minimum length trees for all data sets are also furnished in Table 4 in columns DCLS and DCML, respectively. As can be seen in the column labeled "Best classification,- there is only one case in which the UPGMA-based classification does not have the highest correlation DD' among the three comparisons (DD', DCLS, and DCML). Thus, these data sets do not support the contention that phylogenetic methods generally produce classifications which have higher correlations with the original dissimilarity values. A Test of Predictivity Archie (1980) has investigated the predictive value of various classificationforming procedures for binary coded characters. For each of nineteen different data sets he employed twenty different classificatory procedures used in cladistics and phenetics. The procedures involve different methods for defining similarity as well as both phenetic clustering and cladistic tree forming methods. His procedure is diagrammed in Figure 9. S 1 and S2 correspond to different similarity measures (simple matching versus Jacquard coefficients) and the Cii's refer to the jth classificatory procedure applied to the ith similarity measure (via dendrogram D1; which is omitted from the diagram to simplify it). Since predictivity relates to the ability to predict character states for an OTU, the comparisons made by Archie (1980) are, quite correctly, all of the type X-C. Unfortunately his results are rather complex in that the decision as to which method is "best" depends upon how one measures predictivity (six different methods were studied which considered either errors of inclusion only or both errors of inclusion and exclusion). When one selects in advance which of the two binary states is to be predicted, then WISS trees (Farris et al., 1970) which are approximate solutions for Camin-Sokal trees (Camin and Sokal, 1965) performed best (the predicted state is considered to be the derived state). When one predicts whichever state of each character that can be predicted best for a given data set, then no one method is significantly better than the others. Other results were less clear except for the fact that single-linkage clustering always performed rather poorly. CONCLUSIONS Recent appeals to taxonomists of various philosophies to define the goals of their schools in a manner so that they can be objectively tested are indeed welcome. The goals of phenetic taxonomy have been stated in relatively loose terms which need to be refined. The phylogenetic approach has an unequivocal goal, the description of the true phylogeny (or at least the true cladogeny) but since this is generally unknown it is difficult to measure correspondence between it and estimates produced by a phylogenetic method. These estimates may differ from the true phylogeny because of either insufficient information, incorrect assumptions made in the methods, or inadequate algorithms. A number of techniques are reviewed that employ one or more criteria of desirability. Within any one class of techniques comparisons are readily made. For example, the question of whether ultrametric C1 is a better representation than ultrametric C2 of phenetic matrix S is a meaningful question and can be addressed operationally. But whenever we ask whether cladistic classifications are more congruent than phenetic ones, or more predictive, or better fitting to the original data, we run into difficulties because the definitions of these terms are different in different taxonomic schools. For example, congruence in phenograms

24 482 SYSTEMATIC ZOOLOGY VOL D \ 6 60 A I E B C 60

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 483 0 ---> X S 2 C II 1 c 21 C22 FIG. 9. Scheme of test of predictivity employed by Archie (1980). The comparisons are all between X C measures.

25 1981 COMPARING NUMERICAL TAXONOMIC STUDIES > X S 2 C II 1 c 21 C22 FIG. 9. Scheme of test of predictivity employed by Archie (1980). The comparisons are all between X C measures. Intermediate steps D and T have been omitted to simplify the diagram. For further explanation see text. and cladograms need not be measured in the same way. Thus while we are sanguine about the development of further objective tests to demonstrate improvements in specific phenetic and cladistic techniques, the number of comparisons that can be made between methods from different schools is limited. Summarizing the conclusions of an earlier section, we would state that only in three areas of current interest taxonomic congruence (addition of characters to a classification), addition of OTUs to a classification, and in character state predictivity can valid comparisons between phenetic and cladistic methods be carried out. Attempts to date to make such comparisons, by Mickevich (1978, 1980); Farris (1977, 1979a, b), Schuh and Polhemus (1980) and Schuh and Farris (1981) are inappropriately carried out, hence do not contribute to a resolution of the issues in contention. The study by Archie (1980) has employed correct and meaningful comparisons. However this latter study, limited to binary characters, yielded complex results and does not uniformly support any one school of taxonomy. ACKNOWLEDGMENTS This paper represents contribution No. 262 from the Program in Ecology and Evolution at the State University of New York at Stony Brook. It was supported in part by grants DEB (FJR) and DEB (RRS) from the National Science Foundation. We are indebted to Donald H. Colless, George F. Estabrook, M. F. Janowitz, Bruce Riska and Peter H. A. Sneath and to an anonymous reviewer for valuable comments on earlier drafts of this manuscript. Computer time was made available at the Computation Center of the State University of New York at Stony Brook, at the Computer Facilities of The Division of Biological Sciences, State University of New York at Stony Brook, and at the IBM T. J. Watson Research Center at Yorktown Heights, New York. Joyce Schirmer prepared the illustrations, Barbara McKay typed the initial draft of the manuscript, and Jo Genzano typed the final research report. REFERENCES ARCHIE, J Definition, criteria and testing of the predictive value of classifications. Ph.D. dissertation. State University of New York at Stony Brook. BENVENISTE, R. E., AND G. J. TODARO Evolution of type C viral genes: evidence for an Asian origin of man. Nature, 261: BISBEE, C. A., M. A. BAKER, AND A. C. WILSON Albumin phylogeny for clawed frogs (Xenopus). Science, 195: BUNEMAN, P The recovery of trees from measures of dissimilarity. Pp , in Mathematics in the archaeological and historical sciences (F. R. Hodson, D. G. Kendall, and P. Tautu, eds.), Edinburgh: Edinburgh Univ. Press, 565 pp. BUNEMAN, P A note on the metric properties of trees. J. of Combinatorial Theory, 17: BUSACKER, R. G., AND T. L. SAATY Finite FIG. 8. Summary representations of the Xenopus data. A. UPGMA phenogram of the distance matrix. The cophenetic values implied by this phenogram are the D' values discussed in this paper. The ordinate is in the original dissimilarity scale of the Xenopus study. B. Tree corresponding to the D", matrix of the Xenopus data. OTU 1 has been chosen as the reference OTU. The numbers next to the internodes give branch lengths. C. Minimum-length tree for the Xenopus data. Numbers next to internodes give branch lengths. D. Least squares fitting tree for the Xenopus data. Numbers next to internodes give branch lengths. E. Best classification based on the minimum-length tree topology (in dendrogram C). F. Best classification based on the least squares fitting tree topology (in dendrogram D). The ordinates in dendrograms E and F are in the original dissimilarity scale.

484 SYSTEMATIC ZOOLOGY vol. 30 graphs and networks. McGraw-Hill: New York, 294 pp. GAMIN, J. H., AND R. R. SOKAL. 1965. A method for deducing branching sequences in phylogeny. Evolution, 19:311-326.

26 484 SYSTEMATIC ZOOLOGY vol. 30 graphs and networks. McGraw-Hill: New York, 294 pp. GAMIN, J. H., AND R. R. SOKAL A method for deducing branching sequences in phylogeny. Evolution, 19: CARROLL, J. D., AND J-J. CHANG A method for fitting a class of hierarchical tree structure models to dissimilarity data and its application to some "body parts" data of Miller's. Proc., 81st Ann. Convent. Amer. Psychol. Assoc., 8: CARROLL, J. D., AND S. PRUZANSKY Discrete and hybrid scaling models. Pp , in Similarity and choice (E. D. Lantermann and H. Fager, eds.) Hans Huber: Bern, Switzerland, 392 pp. CASE, S. M Biochemical systematics of the members of the genus Rana native to western North America. Syst. Zool., 27: CROVELLO, T. J The effect of change of number of OTUs in a numerical taxonomic study. Brittonia, 20: DUNCAN, T. 0., AND G. F. ESTABROOK An operational method for evaluating classifications. Syst. Bot., 1: DuPRAw, E. J Non-Linnean taxonomy. Nature, 202: DuPRAw, E. J Non-Linnean taxonomy and the systematics of honeybees. Syst. Zool., 14:1-24. ELDREDGE, N., AND J. CRACRAFT Phylogenetic patterns and the evolutionary process. Columbia: New York, 349 pp. ESTABROOK, G. F Cladistic methodology: a discussion of the theoretical basis for the induction of evolutionary history. Ann. Rev. Ecol. Syst., 3: FARRIS, J. S The meaning of relationship and taxonomic procedure. Syst. Zool., 16: FARRIS, J. S Estimating phylogenetic trees from distance matrices. Amer. Nat., 106: FARRIS, J. S A probability model for inferring evolutionary trees. Syst. Zool., 22: FARRIS, J. S On the phenetic approach to vertebrate classification. In Hecht, M. K., P. C. Goody and B. M. Hecht (eds.), Major patterns in vertebrate evolution. Plenum, New York, pp FARRIS, J. S. 1979a. On the naturalness of phylogenetic classification. Syst. Zool., 28: FARRIS, J. S. 1979b. The information content of the phylogenetic system. Syst. Zool., 28: FARRIS, J. S., A. G. KLUGE, AND M. J. ECKARDT A numerical approach to phylogenetic classification. Syst. Zool., 19: FELSENSTEIN, J Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Zool., 22: GOWER, J. C Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53: GOWER, J. C Multivariate analysis and multidimensional geometry. Statistician, 17: GOWER, J. C Maximal predictive classification. Biometrics, 30: HARARY, F Graph theory. Addition-Wesley: Reading, Massachusetts, 274 pp. HARTIGAN, J. A Representation of similarity matrices by trees. J. Amer. Stat. Assoc., 62: HARTIGAN, J. A Clustering algorithms. Wiley, New York, 351 pp. Ho, C., E. M. PRAGER, A. C. WILSON, D. T. OSUGA, AND R. E. FEENEY Penguin evolution: protein comparisons demonstrate phylogenetic relationships to flying aquatic birds. J. Mol. Evol., 8: JANOWITZ, M. F A note on phenetic and phylogenetic classifications. Syst. Zool., 28: JARDINE, C. J., N. JARDINE, AND R. SIBSON The structure and construction of taxonomic hierarchies. Math. Biosci., 1: JARDINE, N., AND R. SIBSON The construction of hierarchic and non-hierarchic classifications. Computer J., 11: JARDINE, N., AND R. SIBSON Mathematical taxonomy. Wiley, London, 286 pp. JOHNSON, S. C Hierarchical clustering schemes. Psychometrika, 32: JOLLES, J., F. SCHOENTGEN, P. JOLLES, E. M. PRA- GER, AND A. C. WILSON Amino acid sequence and immunological properties of chachalaca egg white lysozyme. J. Mol. Evol., 8: LAKOVAARA, S., A. SACRA, AND C. T. FALK Genetic distance and evolutionary relationships in the Drosophila obscura species group. Evolution, 26: LEQuEsNE, W. J Further studies based on the uniquely derived character concept. Syst. Zool., 21: MAXSON, L. R., AND A. C. WILSON Albumin evolution and organismal evolution in tree frogs (Hylidae). Syst. Zool., 24:1-15. MICKEVICH, M. F Taxonomic congruence. Syst. Zool., 27: MICKEVICH, M. F Taxonomic congruence: Rohlf and Sokal's misunderstanding. Syst. Zool., 29: NELSON, G Cladistic analysis and synthesis: principles and definitions, with a historical note on Adanson's Familles des Plantes ( ). Syst. Zool., 28:1-21. PATRINOS, A. N., AND S. L. HAKIMI The distance matrix of a graph and its tree realization. Quart. Appl. Math., 30: PRAGER, E. M., D. P. FOWLER, AND A. C. WILSON. 1976a. Rates of evolution in conifers (Pinaceae). Evolution, 30: PRAGER, E. M., A. C. WILSON, D. T. OSUGA, AND R. E. FEENEY. 1976b. Evolution of flightless birds on southern continents: transferrin comparison shows monophyletic origin of ratites. J. Mol. Evol., 8:

1981 COMPARING NUMERICAL TAXONOMIC STUDIES 485 ROHLF, F. J. 1972. Empirical comparison of three ordination techniques in numerical taxonomy. Syst. Zool., 21:271-280. ROHLF, F. J., AND R. R. SOKAL.

27 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 485 ROHLF, F. J Empirical comparison of three ordination techniques in numerical taxonomy. Syst. Zool., 21: ROHLF, F. J., AND R. R. SOKAL Comments on taxonomic congruence. Syst. Zool., 29: SARICH, V. M. 1969a. Pinniped origins and the rate of evolution of carnivore albumins. Syst. Zool., 18: SARICH, V. M. 1969b. Pinniped phylogeny. Syst. Zool., 18: SCHUH, R. T., AND J. S. FARRIS Methods for investigating taxonomic congruence and their application to the Leptopodomorpha. Syst. Zool., 30: SCHUH, R. T., AND J. T. POLHEMUS Analysis of taxonomic congruence among morphological, ecological, and biogeographic data sets for the Leptopodomorpha (Hemiptera). Syst. Zool., 29:1-26. SNEATH, P. H. A The application of computers to taxonomy. J. Gen. Microbiol., 17: SNEATH, P. H. A Recent developments in theoretical and quantitative taxonomy: Syst. Zool., 10: SNEATH, P. H. A., AND R. R. SOICAL Numerical taxonomy. Freeman, San Francisco, 573 pp. SOKAL, R. R., AND F. J. ROHLF The comparison of dendrograms by objective methods. Taxon, 11: SOKAL, R. R., AND F. J. ROHLF Taxonomic congruence in the Leptopodomorpha re-examined. Syst. Zool., 30: WALLACE, D. G., M. C. KING, AND A. C. WILSON Albumin differences among ranid frogs: taxonomic and phylogenetic implications. Syst. Zool., 22:1-13. Manuscript received September Revised July APPENDIX A. ULTRAMETRICS, 4-POINT METRICS, AND CLASSIFICATIONS Since the literature on this subject is scattered we have presented below definitions of distances, ultrametrics, 4-point metrics, and classifications as well as some notes on their interrelationships. A familiarity with these is necessary for an understanding of this paper. Jardine and Sibson (1971) define a coefficient to be a dissimilarity coefficient if such a coefficient Si, satisfies the following conditions (for all OTUs i and j in the set 9' of OTUs). Sii 0 C 1 Si, = 0 C2 Sti = sit C3 Note that.in the above context the term "dissimilarity" coefficient has no necessary relationship to general or overall similarity in phenetic taxonomy. It simply implies a coefficient in which small values indicate a close relationship and large values indicate more distant relationships. The biological meaning of the term "relationship" depends upon the definition of Si, employed in a particular application (it could be either phenetic or cladistic, for example). We are concerned here only with its general mathematical properties. The coefficient is a distance coefficient if it also satisfies the triangle inequality condition (for all OTUs i, j, and k in 9'): 1k + C4 In such cases one can visualize a space (called a pseudometric space in topology) in which the OTUs correspond to points and the distance be- tween pairs of points i and j is given by Such distances could be, for example, Euclidean or Manhattan distances. Sometimes it is desirable (for mathematical convenience) to prohibit the possibility that a pair of different OTUs could have a dissimilarity coefficient or a distance of zero. This can be ensured by the definiteness condition: if i 0 j, then 8.,j > 0. C5 If C5 holds (in addition to the previous conditions), then a distance coefficient defines a metric space. The coefficient, Si;, is said to be ultrametric if it also satisfies the following condition (for all OTUs j, and k in.9"): max{8ik, 8.1k} C6 This is obviously a more restrictive condition since max{5a, kk1 - c- RR + kk One can visualize ultrametric Si; values between three OTUs as the lengths of the sides of an isosceles triangle. In such a triangle two of the three distances (sides of the triangle) must be equal to one another and greater than or equal to the third distance. Thus if OTUs i and j are closest, then 5th = kk. Any side of this triangle will thus be less than or equal to the length of the larger of the other two sides hence it fulfills the ultrametric condition C6. The properties of ultrametrics as related to classifications have been pointed out independently by Hartigan (1967), Jardine et al. (1967), and Johnson (1967). If, however, condition C5 does not hold, then one has only a pseudo-ultrametric coefficient. Let us now consider the correspondence between ultrametric coefficients and hierarchic nonoverlapping classification schemes, HCS (after Johnson, 1967). The following definitions are convenient. A hierarchic classification consists of a sequence of partitions Y'0,,, going from the lowest, in which every OTU is placed in its own class or taxon (the disjoint partition), to the highest, Y',, in which all OTUs are placed into a single class (the conjoint partition). Each partition, consists of k mutually exclusive (nonoverlapping) classes, C1, C2,. Ck, If the hierarchic classification is nonoverlapping (as in the usual Linnean classification which we are concerned with here) then every class in Y',, is either a class in or is the

28 486 SYSTEMATIC ZOOLOGY vol. 30 union of two or more classes in This is clearly the type of hierarchical classification visualized by most biological systematists (see, for example, Eldredge and Cracraft, 1980). Associated with each partition g) is its level or rank. Although in a traditional Linnean classification the ranks have names, it is important to recognize that these ranks are ordered. Thus species < genus < family <... etc. In numerical taxonomy numerical values are usually assigned which yields an interval rather than a simpler ordinal scale of taxonomic ranks. Hierarchic classifications can be represented by lists of the classes at each level. A dendro gram is an equivalent representation if it is drawn so that the OTUs at the tips join at successive levels to correspond to the classes at these levels. Another equivalent, but less obvious, representation can be made by defining a coefficient, uu, which measures the classificatory relationship between two OTUs i and j in a given classification as the rank of the lowest level taxon which includes both i and j. For example, if OTUs i and j belong to different genera but to the same family, then uu equals the level "family." In a phenogram based on a dissimilarity scale, the level would be read off the dissimilarity axis at the point where clusters containing i and j first join. In a cladogram the level would be the height (a function of the number of furcation points from the base of the cladogram) of the most recent common ancestor of i and j. Even with only rank-ordered classificatory levels it is possible to show that the classificatory relationship defined above satisfies the ultrametric conditions (see Figure 10A). If we define the level of the lowest level partition (the level at which each class contains only a single OTU) as zero, then condition C2 is satisfied by definition, and higher ranks must necessarily be nonnegative as required by condition C 1. Clearly the lowest level taxon to which i and j belong is the same as the lowest level taxon to which j and i belong so that condition C3 is satisfied. Now consider any 3 OTUs i, j, and k. If the level of the lowest level taxon to which OTUs i and j belong is lower than that of i and k, then the lowest level taxon to which i and k belong is the same taxon as the lowest level taxon to which j and k belong (and hence uu < uik = uik as required by C6). If, on the other hand, the level of the lowest level taxon to which OTUs i and j belong is higher than that of the lowest level taxon containing i and k (Figures 10B and 10C), then uik < uik = uu or uik = uu > uik. Thus C6 is satisfied in either case since uu is equal to the larger of uik or ujk. If the classification is not sufficiently resolved so that i, j, and k all belong to the same lowest level taxon, then C6 is also satisfied by uu = uik = Thus the classificatory relationship between pairs of OTUs in a hierarchic classification can be shown to be ultrametric, and conversely a set of ultrametric distances is equiyalent to a hierarchic classification. There is a 1-to-1 relationship since either representation can be converted into the other with no loss A B C J K K FIG. 10. Three possible dendrograms with three OTUs i, j, and k. See Appendix A for an explanation. of information. For some purposes one representation may be more convenient than another. In place of condition C6 one could require the following 4-point condition (for all OTUs h, i, j, and k in : + Sik max{8k; + 5hk 80 C7 This condition is satisfied by path length distances on a tree (in the graph-theoretical sense). Distance 8u is then the sum of the lengths of all the edges, internodes, on the path between OTUs i and j on the given tree. These 4-point conditions have been pointed out by Buneman (1971, 1974), Patrinos and Hakimi (1972) and others. Farris (1967) has called these path length distances "patristic" distances. Unless the lengths of the edges connecting each OTU to its adjacent internal node (HTU) are all equal, distances which satisfy condition C7 will not K J

29 1981 COMPARING NUMERICAL TAXONOMIC STUDIES 487 be ultrametric and hence not equivalent to a classification (Hartigan, 1975). However, 4-point distances, pi;, can be converted into ultrametric distances by rooting the tree at some OTU (or HTU) k and defining the ultrametric values for all OTUs i # j and arbitrary k in 3' as: un = 0 uid = - Pik - P2k a where a is a sufficiently large constant to ensure that all the /it; are nonnegative. This relationship has been pointed out by Hartigan (1975) and Carroll and Pruzansky (1980). This transformation cannot be inverted, however, since the information on the distances from the reference OTU k (the root) to each OTU has been lost in the transformation to an ultrametric distance. Thus there is not a 1-to-1 relationship between path length distances and a conventional hierarchic classification. As discussed in the section on Tests of Naturalness, Farris (1979b) employed a somewhat different transformation, although it was based on the same idea. The two formulations differ in that Farris (1979b) used a similarity scale for a classification whereas we have illustrated our discussion with a dissimilarity scale. But, more importantly, his special similarity transformation for distances (his equation 1) results in the diagonals of the trans- formed matrix (the self-similarities) being different for different OTUs (see our Table 1, matrix A1, for an example). Thus in his scheme uji ull 0. He is able to carry out the inverse transformation (his equation 3) by employing the cophenetic values in matrix A'1, plus additional information, the values of the self-similarities, back into a matrix of path lengths. Clearly, if a dissimilarity matrix is not ultrametric it is not equivalent to a classification in the sense generally accepted by biological taxonomists. APPENDIX B. MATRIX CORRELATIONS FOR 10 DATA SETS The tables presented below furnish matrix correlations for the remaining nine real data sets and the single artificial data set laid out in a manner identical to that of Table 2 for the Xenopus data. An asterisk in the last column identifies a rejection of the null hypothesis DD"(i) = DD'u) at P s Column headings are identical to those of Table 2. For explanation of symbols see text. Each table is headed by the name of the group of organisms it represents, the overall cophenetic correlation DD' and the critical 5% F-value for the significance tests. The results reported in these tables and in Table 2 are summarized in Table 3. TABLE Al. RANA (WALLACE ET AL., 1973). DD' = , F.05[46] = Reference OTU DD", AA', DD',,, F,,, D(B2) = DM(2D) = Note: The values we present in this table are not consistent with those given in Farris (1979a); the latter appear to be in error. TABLE A2. PINES (PRAGER ET AL., 1976A). DD' = , F,0515,101 = Reference OTU DD", AA', F.1) ' D(B2) = DM(2D) =

Workshop: Biosystematics

Workshop: Biosystematics by Julian Lee (revised by D. Krempels) Biosystematics (sometimes called simply "systematics") is that biological sub-discipline that is concerned with the theory and practice of