Molecular Phylogenetics and Evolution

Molecular Phylogenetics and Evolution 61 (2011) 177 191 Contents lists available at ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev Spurious 99% bootstrap and jackknife support for unsupported clades Mark P. Simmons a,, John V. Freudenstein b a Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA b The Ohio State University Herbarium, 1315 Kinnear Road, Columbus, OH 43212, USA article info abstract Article history: Received 20 February 2011 Revised 25 May 2011 Accepted 8 June 2011 Available online 16 June 2011 Keywords: Frequency-within-replicates bootstrap Jackknife Majority rule consensus Missing data Undersampling-within-replicates artifact Unsupported clade Quantifying branch support using the bootstrap and/or jackknife is generally considered to be an essential component of rigorous parsimony and maximum likelihood phylogenetic analyses. Previous authors have described how application of the frequency-within-replicates approach to treating multiple equally optimal trees found in a given bootstrap pseudoreplicate can provide apparent support for otherwise unsupported clades. We demonstrate how a similar problem may occur when a non-representative subset of equally optimal trees are held per pseudoreplicate, which we term the undersampling-within-replicates artifact. We illustrate the frequency-within-replicates and undersampling-within-replicates bootstrap and jackknife artifacts using both contrived and empirical examples, demonstrate that the artifacts can occur in both parsimony and likelihood analyses, and show that the artifacts occur in outputs from multiple different phylogenetic-inference programs. Based on our results, we make the following five recommendations, which are particularly relevant to supermatrix analyses, but apply to all phylogenetic analyses. First, when two or more optimal trees are found in a given pseudoreplicate they should be summarized using the strict-consensus rather than frequency-within-replicates approach. Second jackknife resampling should be used rather than bootstrap resampling. Third, multiple tree searches while holding multiple trees per search should be conducted in each pseudoreplicate rather than conducting only a single search and holding only a single tree. Fourth, branches with a minimum possible optimized length of zero should be collapsed within each tree search rather than collapsing branches only if their maximum possible optimized length is zero. Fifth, resampling values should be mapped onto the strict consensus of all optimal trees found rather than simply presenting the P50% bootstrap or jackknife tree or mapping the resampling values onto a single optimal tree. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction The effect of missing data on phylogenetic analyses has been studied with respect to their effects on tree construction, polymorphic taxon coding, and assessment of homoplasy (e.g., Nixon and Davis, 1991; Platnick et al., 1991). Wiens (2003) demonstrated that it is not just the amount of missing data in a matrix, but rather their arrangement, that determines their effect on phylogenetic resolution and accuracy. Much less investigation has been focused on the effects of missing data on resampling support analyses. Wilkinson (2003) observed that wildcard terminals (Nixon and Wheeler, 1991), containing many missing values, can lower resampling support values for many clades on the inferred tree. But wildcard terminals can also raise resampling support values, sometimes dramatically, including those for clades that are unsupported by the data. The vast majority of parsimony- (Farris, 1970; Fitch, 1971) and likelihood-based (Felsenstein, 1973) phylogenetic analyses Corresponding author. Fax: +1 970 491 0649. E-mail address: psimmons@lamar.colostate.edu (M.P. Simmons). quantify branch support using the bootstrap (BS; Felsenstein, 1985) or jackknife (JK; Farris et al., 1996). Although interpretations of BS/JK values vary (e.g., Felsenstein, 1985; Hillis and Bull, 1993; Sanderson, 1995), use of one of these resampling procedures is generally considered to be an essential component of rigorous parsimony and likelihood phylogenetic analyses (e.g., Page and Holmes, 1998; Graur and Li, 2000; Felsenstein, 2004). BS/JK support is typically presented on a majority-rule-consensus (MRC), which was introduced by Margush and McMorris (1981). In addition to being a way of presenting resampling support values, the MRC is sometimes used in the same manner in which it was originally proposed. Margush and McMorris (1981) introduced MRC not as a method to summarize BS/JK support, but rather to summarize equally optimal trees. The sole justification given by Margush and McMorris (1981, p. 242) for using MRCs to summarize equally optimal trees was that, We feel that many requirements of a consensus are met by what has been called majority rule in the social sciences. The use of MRCs to summarize equally optimal phylogenetic trees has since been criticized by Barrett et al. (1991), Nixon and Carpenter (1996), Sharkey and Leathers (2001), and Sumrall et al. (2001). Like Barrett et al. 1055-7903/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2011.06.003

178 M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 (1991, p. 487),...we know of no general justification for this approach. Note that this criticism of MRCs does not apply to their use in summarizing BS/JK support from multiple pseudoreplicates. Both Sharkey and Leathers (2001) and Sumrall et al. (2001) demonstrated that ambiguity is determinate to MRCs, such that more ambiguity in one part of a tree leads to greater apparent support for clades in other parts of the tree. As Sharkey and Leathers (2001, p. 282) stated, This preference [of MRCs] is based on the implicit assumption that all fundamental cladograms are independent and equally likely to be the correct tree. This assumption is unfounded. Sumrall et al. (2001, p. 256) further clarified and showed that: A combination of two factors a labile terminal taxon and unequal numbers of optimal trees consistent with each of its regional placements is hypothesized to cause bias in majority-rule consensus trees. This bias effectively favors the least resolved alternate regional island of parsimony. Sumrall et al. (2001) concluded by noting that the negative implications of using MRC extend to resampling measures, as in the BS and the JK, when MRC-like assumptions are used to summarize multiple equally optimal trees found within a given pseudoreplicate. Felsenstein (1985) did not specifically address how to handle multiple equally optimal trees when he introduced the BS to phylogenetics (see also De Laet et al., 2004). But Felsenstein (2004, p. 339) clarified his position: Some methods may give us more than one estimate of the phylogeny...in such cases we can consider that if 10 tied estimates are found for one bootstrap replicate, we consider each to be one-tenth of a tree, so that the results from that bootstrap replicate are not overemphasized when the trees are combined. Table 1 Data matrix of binary characters for part A of the third contrived example. 1A 1 111111 111111 00000 1000000000 2A 1 111111 111111 00000 0100000000 3A 1 111111 111111 00000 0010000000 4A 1 111111 111111 00000 0001000000 5A 1 111111 111111 00000 0000100000 6A 1 111111 111111 00000 0000010000 7A 1 111111 111111 00000 0000001000 8A 1 111111 111111 00000 0000000100 9A 1 111111 111111 00000 0000000010 10A 1 111111 111111 00000 0000000001 1B 0 000000 000000 00000 0000000000 2B 0 000000 000000 10000 0000000000 3B 0 111111 000000 01000 0000000000 4B 0 111111 111111 00100 0000000000 5B 0 111111 111111 00010 0000000000 6B 0 111111 111111 00001 0000000000 Wildcard 0???????????? 11111 1111111111 Although this approach does not overemphasize such a BS pseudoreplicate, assigning a score of 0.1 to each of the 10 equally optimal trees does overemphasize each of the clades that is not present in all 10 of those trees. That is to say that this approach assigns support to a clade(s) that is not present in the strict consensus. As noted by Goloboff et al. (2003), any clade present in all optimal trees is by definition supported by the data (irrespective of how well supported the clade may be as measured by a branch-support criterion); any clade present in one or more, but not all, optimal trees is unsupported; and any clade that is not present in any of the optimal trees is contradicted. We follow Goloboff et al. s (2003) definitions throughout this paper. Felsenstein s (2004) approach to summarizing equally optimal trees found in a given pseudoreplicate (termed the frequency-within-replicates bootstrap [FWR BS] by Soreng and Davis, 1998) has been explicitly incorporated into both PHYLIP (Felsenstein, 2009) and PAUP (Swofford, 2001), whereas NONA (Goloboff, 1999a,b) and TNT (Goloboff et al., 2008) explicitly implement the strict-consensus bootstrap (SC BS), in which only clades resolved in the strict consensus (Schuh and Polhemus, 1980) of all optimal trees found within each pseudoreplicate are considered (i.e., 1 if present, 0 if absent). The SC BS generally provides lower support than the FWR BS when multiple trees are held for each pseudoreplicate (Davis et al., 1998; Soreng and Davis, 1998) and is preferred because it does not rely on the unjustified MRC-like approach that inflates inferred support (Davis et al., 2004; Freudenstein and Davis, 2010). Inflated support for clades that are present in the SC of all equally optimal trees, as described by Davis and colleagues, is not the only problem that can be caused by use of the FWR BS. De Laet et al. (2004, p. 590) stated in a meeting abstract that,... this method is defective in that it can yield high resampling frequencies for groups that are unsupported by the data, and this can occur in both parsimony and likelihood analyses. Goloboff and Pol (2005, p. 152) provided a contrived example of how this can occur by using a wildcard terminal consisting entirely of missing data in an otherwise well supported (four uncontradicted synapomorphies per clade) pectinate tree. FWR BS support in the example is dependent upon clade size (ranging from 50% for an 11-terminal clade to 96% for a two-terminal clade), showing the same artifact as Bayesian MCMC (Yang and Rannala, 1997) posterior probabilities (see also Pickett and Randle, 2005). The FWR BS artifact that was first suggested by Sumrall et al. (2001), and later clarified and demonstrated by De Laet et al. (2004) and Goloboff and Pol (2005), is expected to occur in data matrices with one or more wildcard terminals. Terminals may behave as wildcards when they are unambiguously scored for only a few parsimony informative characters (as in incompletely preserved fossils; Nixon and Wheeler, 1991), when they are scored as autapomorphies for most parsimony informative characters (because the autapomorphies behave identically to missing data and inapplicable entries (i.e.,, as with gaps in nucleotide characters; Simmons and Ochoterena, 2000) in a parsimony context), and when there is extreme character conflict caused by convergence or reversal between divergent terminals (e.g., characters 14 28 in Table 1). Two types of empirical data matrices in which wildcard terminals may be expected to occur are those for which a large number of terminals are sampled relative to the number of parsimonyinformative characters (e.g., Källersjö et al. (1998) sampled rbcl for 2538 plants; Tehler et al. (2003) sampled 18S rdna for 1551 fungi) and supermatrix analyses with high amounts of missing data and low overlap in loci sampled among terminals (e.g., McMahon and Sanderson (2006) sampled 2228 Papilionoideae legumes for 33,168 nucleotide characters with only 4.3% of the cells scored as nucleotides; see supplemental online data matrices posted at: http://www.biology.colostate.edu/research/ for examples of the low overlap in loci sampled among closely related terminals). In addition to the FWR BS/JK artifact providing apparent support for otherwise unsupported clades, the same type of problem may occur when a non-representative subset of equally optimal trees are held per BS/JK pseudoreplicate. By non-representative subset we mean that the SC of the optimal trees sampled is more resolved than the SC of all equally optimal trees and is resolved such that at least one unsupported clade is more likely to be resolved than by random chance alone. By definition, a single hillclimbing heuristic search (as in nearest-neighbor interchange [NNI], subtree pruning regrafting [SPR], and tree bisection reconnection [TBR]) is only capable of sampling a single island of equally optimal trees (Maddison, 1991). If only a subset of the islands of equally optimal tree islands is sampled, the SC may resolve

M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 179 unsupported clades (Maddison, 1991). Yet it is not necessary to sample every tree on every island to obtain a properly resolved SC (Goloboff, 1999a,b). Weakly supported clades are generally not expected to survive BS/JK resampling (Farris et al., 1996). But even unsupported clades can survive resampling if the trees sampled from most pseudoreplicates are a non-representative subset, which we refer to as the undersampling-within-replicates BS/JK artifact and demonstrate below. One difference between the undersampling-within-replicates and FWR artifacts is that the former does not apply when all equally optimal trees are held in each pseudoreplicate, whereas the latter applies whenever multiple equally optimal trees are held. A second difference is that the undersampling-within-replicates artifact is applicable to pseudoreplicates in which only a single optimal tree is held irrespective of how many optimal trees there are (if more than one), whereas the FWR artifact does not apply unless two or more equally optimal trees are held. In this manuscript we illustrate the FWR and undersamplingwithin-replicates BS/JK artifacts using both contrived and empirical examples, demonstrate that the artifacts can occur in both parsimony and likelihood analyses, show that the artifacts occur in outputs from multiple different phylogenetic-inference programs, and infer in which types of data matrices the artifacts are most likely to occur. We conclude by making recommendations on how to minimize the number and severity of occurrences of the artifacts. 2. Methods 2.1. Contrived examples Four contrived examples were created to demonstrate the artifacts. All examples consist entirely of binary parsimony informative characters. The first example consists of 102 terminals and 112 characters. With the exception of the wildcard terminal, which is scored as missing data for the following 12 characters, six uncontradicted synapomorphies support each of the following two clades: (1, 2) and (1, 2, 3). The remaining 100 characters alternately unite terminals 2 101 with the wildcard terminal. There is very strong support for both clade (1, 2) and (1, 2, 3) relative to terminals 4 101, yet there is no support for those two clades excluding the wildcard terminal. Because of the wildcard terminal, the SC is entirely unresolved. This first example was created to demonstrate how high the inferred support can be for unsupported clades. The second example is identical to the first example, except that it consists of only 11 terminals and 21 characters. Because of the wildcard terminal, the SC is entirely unresolved. This example was created to be more computationally tractable than the first example and also demonstrate the importance of collapsing branches with a minimum possible length of zero (see below under Section 2.3). The third example consists of two parts A and B. The first character of part A unites terminals 1B 6B and the wildcard terminal separately from terminals 1A 10A (Table 1) in the SC. With the exception of the wildcard terminal, which is scored as missing data for characters 2 13, six uncontradicted synapomorphies support each of the following two clades: (1B, 2B) and (1B, 2B, 3B). The remaining 15 characters alternatively unite terminals 1A 10A and 2B 6B with the wildcard terminal. Part B consists of only the last seven terminals of part A. This third example was created to demonstrate that a localized wildcard terminal in the SC can behave as a global wildcard in some BS and JK pseudoreplicates, thereby making the artifacts more severe than may otherwise be expected. The fourth example consists of 14 terminals and 60 characters (Table 2). Terminals 7 and 8 are identical to each other and are the only terminals without any missing data. Six uncontradicted synapomorphies are provided for each of 10 clades (or 12 synapomorphies for each of five clades, depending on how the tree is resolved), but because of the missing data, the SC is entirely unresolved. This example was created to demonstrate how low character overlap between sampled terminals in supermatrix analyses can create the artifacts. 2.2. Empirical data Four empirical data sets were included in this study, two of which sampled a single locus for all terminals (Bailey et al., 2006; Richardson et al., 2006), and two of which sampled many loci but have substantial missing data (McMahon and Sanderson, 2006; Thomson and Shaffer, 2010; Table 3). All four data sets produced trees that have one or more large polytomies and many weakly supported clades. Each of the four data sets were split into two (Thomson and Shaffer, 2010) or six (all others) sub-matrices to enable relatively thorough most parsimonious, bootstrap, and jackknife tree searches to be computationally tractable in PAUP. The Bailey et al. (2006) data set used consists of 618 unique rdna internal transcribed spacer (ITS) sequences (after deletion of 128 duplicates) sampled from the Brassicaceae (mustards). The data matrix comprises 721 nucleotide characters, of which 451 are parsimony informative. The 618 terminals were split into six ± natural groups (to the degree possible, based on the tree presented in Fig. S1 of Bailey et al. (2006)) of similar size (Table 3). The Richardson et al. (2006) data set used consists of 511 unique kinesin-superfamily-motor-domain sequences (after deletion of 18 duplicates) sampled from 19 model species across eukaryotes. The data matrix comprises 2584 amino acid characters, of which 834 are parsimony informative. The 511 terminals were split into six ± natural groups (to the degree possible, based on the tree presented in Figs. 2 10 of Richardson et al. (2006)) of similar size (Table 3). Table 2 Data matrix of binary characters for the fourth contrived example. 1 000000 000000 000000 000000 000000?????????????????????????????? 2 000000 000000 000000 000000 000000?????????????????????????????? 3 111111 000000 000000 000000 000000?????????????????????????????? 4 111111 111111 000000 000000 000000?????????????????????????????? 5 111111 111111 111111 000000 000000?????????????????????????????? 6 111111 111111 111111 111111 000000?????????????????????????????? 7 111111 111111 111111 111111 111111 000000 000000 000000 000000 000000 8 111111 111111 111111 111111 111111 000000 000000 000000 000000 000000 9?????????????????????????????? 111111 000000 000000 000000 000000 10?????????????????????????????? 111111 111111 000000 000000 000000 11?????????????????????????????? 111111 111111 111111 000000 000000 12?????????????????????????????? 111111 111111 111111 111111 000000 13?????????????????????????????? 111111 111111 111111 111111 111111 14?????????????????????????????? 111111 111111 111111 111111 111111

180 M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 Table 3 Properties of data matrices sampled. Matrix # Terminals # Pars. inf. chars % Missing/inapplicable for pars. inf. chars Maximum % missing/inapplicable for single terminal # Clades on strict consensus # Additional clades on majority-rule consensus Contrived 1 102 112 0.1 10.7 0 2 2 2A 11 21 5.2 57.1 0 2 2 2B 11 21 5.2 57.1 0 2 2 3A 17 28 2.5 42.9 1 2 2 3B 7 17 10.1 70.6 0 2 2 4 14 60 43.0 50.0 0 1 1 + 10 Bailey et al. 1 89 310 11.4 21.6 60 12 2 2 95 93 1.3 6.5 30 8 1 3 97 282 15.6 18.8 76 4 3 4 134 266 6.8 31.6 89 16 1 5 84 190 8.4 15.8 57 6 0 6 119 205 6.6 21.5 68 7 0 Richardson 1 78 390 22.1 40.3 70 4 2 2 93 394 24.2 42.1 57 32 4 3 72 442 25.3 48.9 54 13 1 4 86 396 24.6 40.7 68 15 2 5 68 395 20.8 35.4 63 2 1 6 114 389 21.9 31.1 88 23 2 McMahon 1 97 526 45.1 96.0 0 72 46 2 101 1250 60.9 96.2 11 63 54 3 100 152 20.7 97.4 37 11 4 4 104 1079 71.4 99.6 0 45 12 5 113 991 37.2 100 0 66 52 6 96 823 75.6 90.5 3 56 26 Thomson 213 1 108 3311 69.1 99.3 72 20 15 2 105 1961 48.8 92.2 99 0 N/A Thomson 223 1 114 3367 70.2 99.7 8 76 67 2 109 1981 50.3 96.2 100 2 2 a Defined as clades in the MRC but not the SC showing P50% BS and/or JK support in any of the parsimony-based PAUP or TNT results. # Problem clades a The McMahon and Sanderson (2006) data set used consists of the dense supermatrix of 2228 terminals sampled from the Papilionoideae legumes. Of the 33,168 nucleotide characters, 7199 are parsimony informative. The data matrix consists of 89% missing data, 6.7% gapped positions (no gap characters were coded), and only 4.3% nucleotides. Six convex groups (monophyletic or paraphyletic) as resolved on the SC of 5000 equally parsimonious trees downloaded from http://loco.biosci.arizona.edu/sandlab/www_data/ pubdata.htm, without any overlapping terminals, were sub-sampled from the original matrix. Unlike the other convex groups, the third matrix was largely homogeneous with respect to taxon sampling (all terminals from Astragalus) and character sampling (all but three terminals were sampled for both ITS1 and ITS2 of rdna; one terminal was only sampled for ITS1 and one terminal was only sampled for ITS2; the entire data matrix consisted of only 20.7% missing or inapplicable entries; Table 3). One duplicate sequence was deleted from the fifth matrix. Two alternative data sets were used from Thomson and Shaffer s (2010) supermatrix of 53,406 nucleotide characters (of which 4440 are parsimony informative in the 223-terminal matrix and 4511 are parsimony informative in the 223-terminal matrix) sampled from turtles. The inferred phylogeny from the 213-terminal matrix was presented in Thomson and Shaffer s (2010) Fig. 5, after removal of 10 wildcard (or rogue ) terminals. Both the 213- and 223-terminal matrices were split into two sub-matrices based on which taxa are present on pages 53 and 54 of Thomson and Shaffer (2010). The 10 wildcard terminals were assigned to these two sub-matrices based on current turtle taxonomy. All data sets analyzed for this study, both contrived and empirical, are posted as supplemental online data at http://www.biology.colostate.edu/research/. 2.3. Parsimony tree searches With the exception of contrived example 2B, all PAUP ver. 4b10 and TNT ver. 1.1 searches were performed such that branches with a minimum possible optimized length of zero would be collapsed following Davis et al. (2005) and Freudenstein and Davis (2010). Failure to collapse these branches can lead to a proliferation of equally optimal trees with branches that are only supported under some character-state optimizations, particularly in the context of missing data (Kitching et al., 1998; Kearney and Clark, 2003). Contrived example 2B was run using the default setting in PAUP (collapse branches only if their maximum length is zero) and by deactivating the collapse function in TNT. For contrived examples 2 4, searches for the most parsimonious trees were performed in PAUP using branch and bound with all most parsimonious trees being held, after which the SC and MRC were calculated. BS and JK analyses for contrived examples 2, 3B, and 4 were performed using 10,000 replicates. Following Farris et al. (1996), the JK deletion probability was set to 0.367879 and jac resampling was emulated. The following three approaches were used for calculating BS and JK support: (1) branch-and-bound searches while holding a single optimal tree for each pseudoreplicate, (2) 100 random-addition-sequence (RAS) searches employing TBR swapping with up to 10 optimal trees held per RAS search (hence up to 1000 trees held per pseudoreplicate), and (3) branch-and-bound searches while holding all optimal trees for

M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 181 each pseudoreplicate. In all BS and JK analyses performed for this study, only clades with P50% support were considered. Because of time limitations when attempting branch-andbound searches, BS and JK support for contrived example 3A were calculated using the following three approaches: (1) one RAS search with TBR swapping and only a single tree held for each of the 2500 pseudoreplicates, (2) 100 RAS searches employing TBR swapping with up to 10 trees held per TBR search for each of the 10,000 pseudoreplicates, and (3) 10 RAS searches employing TBR swapping with up to 10,000 trees held per RAS search (hence up to 100,000 trees held per pseudoreplicate). For contrived example 1 and all empirical matrices, searches for the most parsimonious trees were performed in PAUP using 1000 RAS searches employing TBR swapping and up to 10,000 trees held per RAS search (hence up to 10 million trees held), after which the SC and MRC were calculated. BS and JK analyses were generally performed using 2500 pseudoreplicates. The following four approaches were used for calculating BS and JK support: (1) one RAS search with TBR swapping and only a single tree held, (2) 10 RAS searches with TBR swapping and only a single tree held per RAS search (hence up to 10 trees held per pseudoreplicate), (3) 10 RAS searches with TBR swapping and up to 10 trees held per RAS search (hence up to 100 trees held per pseudoreplicate), and (4) 10 RAS searches with TBR swapping and up to 10,000 trees held per RAS search (hence up to 100,000 trees held per pseudoreplicate). Because of memory and/or speed limitations, fewer than 2500 BS and/or JK pseudoreplicates were used for 14 of the empirical matrices (from Bailey et al. (2006), McMahon and Sanderson (2006), and Thomson and Shaffer (2010)) when performing 10 RAS searches with TBR swapping and up to 10,000 trees held per RAS search. The number of pseudoreplicates actually used in these cases ranged from 23 to 1812 with a median of only 55. Because of these low numbers of pseudoreplicates, a high degree of error is expected relative to the other BS and JK analyses performed. These 14 abbreviated PAUP analyses constitute just 8% of all PAUP BS and JK analyses performed. Greater than ±1% accuracy is expected for all BS analyses using 10,000 pseudoreplicates, and greater than ±2% accuracy is expected for all BS analyses using 2500 replicates, for BS support P50% (Hedges, 1992). Identical BS and JK searches to those performed in PAUP were performed in TNT, with the exception that the JK deletion probability was set to 0.37 because TNT does not allow it to be set beyond the nearest hundredth. Because of the far greater speed relative to PAUP, all TNT searches were run to completion. To the degree possible, the four sets of contrived data sets were also run in DAMBE (Xia and Xie, 2001), MEGA (Tamura et al., 2007), PHYLIP (Felsenstein, 2009), POY4 (Varón et al., 2010), and SeaView (Gouy et al., 2010). The 0 and 1 character states were converted to adenines and thymines for the DAMBE, MEGA, and SeaView analyses. DAMBE ver. 5.2.18 was used to perform both BS and 50%-deletion JK analyses (the probability of deletion cannot be varied for JK analyses). There are no options to control the quality of tree-search or the number of trees held per random-addition-order, BS, or JK pseudoreplicate. Because of memory limitations, only 1000 pseudoreplicates, each consisting of 10 RAS searches, were performed. Because DAMBE eliminates any characters with missing data,? was changed to autapomorphic cytosines for contrived examples 1 3. Because DAMBE is limited to nucleotide character states and would treat N s as missing data, it was impossible to run contrived matrix 4. Presumably because of limitations on the number of terminals, DAMBE was unable to complete even a single search for the most parsimonious trees on contrived example 1. As such, the DAMBE results are limited to contrived examples 2 and 3. MEGA ver. 4.1b3 was used to perform BS analyses. Tree searches were performed using all characters rather than the default setting of excluding those characters with gaps or missing data. There are no options to control the number of trees held per RAS replicate or BS pseudoreplicate. Because of memory and speed limitations, 10,000 BS pseudoreplicates, each consisting of 100 RAS searches, were performed using close-neighbor-interchange with the intermediate search level (two) for contrived examples 2 4. Because of time limitations, contrived example 1 was analyzed using 200 BS pseudoreplicates, each consisting of five RAS searches using close-neighbor interchange with the lowest search level (one). The PHYLIP ver. 3.69 program package was used to perform BS and JK analyses. Matrix weight files were constructed with SEQ- BOOT and these were used to perform the parsimony analyses with PARS. 10,000 pseudoreplicates were run in all cases. Within PARS, five RAS searches with the more thorough branch swapping option were applied in each pseudoreplicate. One, 1000, or 10,000 trees were held per pseudoreplicate. The JK deletion fraction was set at 37%. MRCs with resampling values were constructed with CONSENSE. POY4 ver. 4.1.2 was used to perform BS and JK analyses. Data were run as morphological characters with a static matrix. 2500 pseudoreplicates were performed, using 10 RAS searches with TBR branch swapping, and saving one, 1000, or 10,000 trees per RAS search. SeaView ver. 4.2.6 and 4.2.10 were used to perform BS analyses through dnapars from PHYLIP ver. 3.52. All gapped characters were included in the analyses and gaps were treated as unknown states rather than as a separate character state. There are no options to control the quality of tree-searches. Because of memory and speed limitations, only 1000 BS pseudoreplicates were performed for most analyses (200 and 150 pseudoreplicates were performed for the first contrived example when holding 10 and 10,000 trees per search, respectively). It is unclear whether the number of optimal trees retained is set per RAS search (as in PAUP ) or BS pseudoreplicate, but this was alternatively set to one RAS search with one tree held, 100 RAS searches with 10 trees held, or 100 RAS searches with 10,000 trees held. Together with M. Gouy (pers. comm., 2010), we found a bug in how SeaView ver. 4.2.6 handles multiple equally parsimonious trees within a given pseudoreplicate, which M. Gouy corrected in ver. 4.2.10. Only the results from ver. 4.2.10 are reported here. 2.4. Likelihood tree searches To demonstrate that the FWR and undersampling-within-replicates artifacts are not limited to parsimony-based analyses, the second contrived example was also analyzed using likelihood in PAUP and RAxML ver. 7.0.3 (Stamatakis, 2006). The second contrived example was modified from 0/1 character states into nucleotide character states, with the same character-state distribution among terminals for each character being maintained (Table 4). All 12 types of nucleotide substitutions are represented among the 21 characters. In addition to analyzing the data matrix of 11 terminals coded for 21 characters, a second data matrix was created by adding 1000 invariant characters with 250 characters for each of the four nucleotides. So as to perfectly fit the data, the Felsenstein (1981) model (F81) was applied to the 21-character matrix in PAUP by selecting Nst = 1 and the default setting of BaseFreq = Empirical. The F81 model with invariant sites (Reeves, 1992) set to pinvar = 0.9794 (i.e., 1000/1021) was applied to the 1021-character matrix. The optimal likelihood trees were found for each matrix using branch-and-bound while holding all trees. Both BS and JK analyses were performed on each matrix using 2500 pseudoreplicates, and RAS followed by TBR searches (one RAS search with one tree held,

182 M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 Table 4 Data matrix of binary characters that have been modified into nucleotide character states for likelihood analyses of the second contrived example. 1 ATGCAT TGCAGC ATGCATGCG 2 ATGCAT TGCAGC TTGCATGCG 3 TGCAGC TGCAGC AGGCATGCG 4 TGCAGC ATGCAT ATCCATGCG 5 TGCAGC ATGCAT ATGAATGCG 6 TGCAGC ATGCAT ATGCGTGCG 7 TGCAGC ATGCAT ATGCACGCG 8 TGCAGC ATGCAT ATGCATTCG 9 TGCAGC ATGCAT ATGCATGGG 10 TGCAGC ATGCAT ATGCATGCA Wildcard???????????? TGCAGCTGA 10 RAS searches with one tree held per RAS search, and 10 RAS searches with up to 10 trees held per RAS search). Within RAxML, an optimal likelihood tree was searched for using 1000 independent searches starting from randomized parsimony trees with the GTRGAMMA model (the simplest model available for nucleotide characters in RAxML) and four discrete rate categories. BS analyses were conducted using 2500 pseudoreplicates with 100 searches per pseudoreplicate and using the f i option, which refine[s] the final BS tree under GAMMA and a more exhaustive algorithm (Stamatakis, 2008, p. 9). 2.5. Data analysis The BS and JK results for each data matrix were plotted onto the MRC using TreeGraph 2 ver. 2.0.45 197 (Stöver and Müller, 2010). Any additional clades resolved in a given BS or JK tree, which we found can occasionally occur, were not recorded. Only clades with P50% BS or JK support are reported unless otherwise stated. The raw data are posted as supplemental online data at http:// www.biology.colostate.edu/research/. A minor confounding effect when comparing PAUP and TNT results is that the JK deletion probability for PAUP was set to 0.367879, whereas it was set to 0.37 in TNT. For the four empirical data sets, the results from the two singlelocus studies (Bailey et al., 2006; Richardson et al., 2006) were analyzed together and the results from the two supermatrix studies (McMahon and Sanderson, 2006; Thomson and Shaffer, 2010) were analyzed together. To eliminate any redundancy, only the results from the 223-terminal Thomson and Shaffer (2010) matrix were analyzed (though results from the 213-terminal matrix are included as supplemental online data). The rationale for making these pairwise groups was based on preliminary data inspection indicating consistency of results within each of the combined data sets; and that the single-locus and supermatrix studies are fundamentally different from each other with respect to numbers of parsimony-informative characters, percent missing and inapplicable entries for parsimony-informative characters, and the maximum percent of missing and inapplicable entries from single terminals (Table 3). To sum and quantify the BS and JK support for otherwise unsupported clades, the averaged number of artifactual clades resolved (Simmons et al., 2010), a measure similar to the averaged overall success of resolution (Simmons and Webb, 2006), was applied. This measure scales BS and JK support to that conferred by one to four uncontradicted synapomorphies. Clades with 50 62% BS/JK support (less than that provided by one uncontradicted synapomorphy) were set to 0.2, 63 85% support to 0.4, 86 94% support to 0.6, 95 97% support to 0.8, and 98 100% support (equivalent to at least four uncontradicted synapomorphies) to 1.0. Least-squares regression equations and determination coefficients were calculated in Microsoft Excel. For cases where the raw data were appropriate for regression (i.e., each value was an estimate of some parameter), those values were used directly. In cases where the data points that were collected resulted in a sum, the sums were used for regression, resulting in fewer data points being used. Because it is likely that the assumption of homoscedasticity is violated in studies of branch-support values given that variance is typically greater with lower values than with higher ones (Hedges, 1992), significance values were not determined for the regressions, but the slopes and determination coefficients (r 2 ) were used as a general guide to the patterns and strength of correlations. In order to test directly whether the SC or another approach is implemented in a particular program, a simple binary matrix was constructed and run using a small number n of resampling replicates and saving multiple trees per replicate (Jerrold I. Davis, pers. comm., 2010). Programs using a SC approach can only yield support values that are multiples of 1/n, while other approaches, such as FWR, can yield intermediate values. The specific matrix that we used may be accessed in the supplemental online data. 3. Results 3.1. Contrived examples In the first contrived example, the tree is entirely unresolved in the SC, but clades (1, 2) and (1, 2, 3) are present in 99% and 98%, respectively, of the 100 most parsimonious trees found (Fig. 1). All programs reported P97% BS for clade (1, 2) and P95% BS for clade (1, 2, 3). Similarly high JK support was reported by PAUP and PHYLIP for the same two clades, yet both clades received <50% JK support by TNT when more than one tree could be held per pseudoreplicate. Due to computational time constraints, not all PHYLIP or POY4 analyses were completed. This first example demonstrates how high the inferred support can be for unsupported clades. In the second contrived example, the tree is entirely unresolved in the SC, but clades (1, 2) and (1, 2, 3) are present in 89% and 78%, respectively, of the nine most parsimonious trees when collapsing branches with a minimum possible optimized length of zero (Fig. 2). The same two clades are present in 98% and 96% of the 51 most parsimonious trees when such branches were not collapsed. Collapsing such branches always decreased both PAUP and TNT BS and JK values for the unsupported clades when all trees were held in each pseudoreplicate, but did not do so when only a single tree was held. Consistent with the first contrived example, only TNT and POY4 JK values were <50% for both unsupported clades when multiple trees were held per pseudoreplicate and branches with a minimum possible optimized length of zero were collapsed. This second example demonstrates the importance of collapsing branches with a minimum possible length of zero. In part A of the third contrived example, only clade (1B 6B + wildcard) is resolved in the SC, but clades (1B, 2B) and (1B, 2B, 3B) are present in 80% and 60%, respectively, of the most parsimonious trees found (Fig. 3). Despite being present in the SC and most JK trees, clade (1B 6B + wildcard) is unresolved in all P50% BS trees. Only SeaView BS and TNT and POY4 JK values are <50% for both unsupported clades when multiple trees were held per pseudoreplicate. Part B of the third contrived example, after elimination of terminals 1A 10A, is otherwise identical to part A with respect to resolution on the SC and MRC trees. Although the two unsupported clades might be considered independent of clade (1B 6B + wildcard) from part A, BS and JK values for these two unsupported clades generally decreased in part B, often substantially, and particularly when multiple trees were held per pseudoreplicate. This third example demonstrates that a localized wildcard terminal in the SC (as in part A) can behave as a global wildcard in some BS

M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 183 Fig. 1. Results from the first contrived example plotted on the majority-rule consensus. Percentages in the majority-rule consensus are at the left edge of each branch. BS support is above each branch while JK support is below each branch. The maximum number of trees held in a given pseudoreplicate increases from left to right (1, 10, 100, 100,000). The BS/JK supports for MEGA, PHYLIP, and SeaView analyses are at their closest inferred positions. If a clade was not resolved in a given P50% BS or JK tree then it is reported as. Fig. 2. Results from the second contrived example plotted on the majority-rule consensus for both parsimony and likelihood. Percentages in the parsimony majority-rule consensus are at the left edge of each branch. BS support is above each branch while JK support is below each branch. The maximum number of trees held in a given pseudoreplicate increases from left to right (parsimony: 1, 1000, all; PAUP likelihood: 1, 10, 100). The BS/JK supports for DAMBE, MEGA, PHYLIP, RAxML, and SeaView analyses are at their closest inferred positions. If a clade was not resolved in a given P50% BS or JK tree then it is reported as. col 0 refers to collapsing branches if the minimum optimized length is zero; said branches were not collapsed in the other PAUP and TNT analyses reported here.

184 M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 Fig. 3. Results from the third contrived example plotted on the majority-rule consensus. Percentages in the majority-rule consensus are at the left edge of each branch. BS support is above each branch while JK support is below each branch. The a analyses are those that incorporated terminals 1A 10A, whereas the b analyses did not. The maximum number of trees held in a given pseudoreplicate increases from left to right (1, 1000, all). The BS/JK supports for DAMBE, MEGA, PHYLIP, and SeaView analyses are at their closest inferred positions. If a clade was not resolved in a given P50% BS or JK tree then it is reported as. indicates that the wildcard terminal was resolved on the other side of the branch in the TNT BS/JK trees. and JK pseudoreplicates, thereby making the artifacts more severe than may otherwise be expected (as in part B). In the fourth contrived example, the tree is entirely unresolved in the SC, but clade (7, 8) is present in 99.9% of the most parsimonious trees (Fig. 4). The BS and JK values assigned to this unsupported clade differed dramatically among phylogenetic-inference programs as well as, in PAUP, the number of trees held. SeaView BS, TNT BS and JK, and PAUP BS and JK, when only a single tree could be held per pseudoreplicate, are <50%. In contrast, MEGA BS, PHYLIP BS and JK, and PAUP BS and JK, when multiple trees could be held per pseudoreplicate, are P95% and frequently 100% (presumably due to rounding error, at least in the PAUP branch-and-bound searches). The BS and JK values also differ dramatically for the other clades shown in Fig. 4 depending on the program and the number of trees held. PAUP BS and JK values are 100% for all of these clades [except clade (1 6) with 99% JK] and TNT BS and JK values are 54% when only a single most parsimonious tree was held per pseudoreplicate. Yet these clades received <50% BS and JK values by PAUP and TNT when all optimal trees were held per pseudoreplicate. Alternatively, PHYLIP BS and JK values are 64% for clades (1, 2) and (13, 14), yet <50% for the other clades, when only a single tree could be held per pseudoreplicate. SeaView provided different values still, with P60% BS for four of the clades when only a single tree could be held per pseudoreplicate. This fourth example demonstrates how minimal character overlap between sampled terminals in supermatrix analyses can create the artifacts and how the BS and JK values can vary dramatically amongst programs. 3.2. Empirical data All of the following results are restricted to clades that were resolved in the MRC but not in the SC. By definition (Goloboff et al., 2003), all such clades are unsupported (or even contradicted if such clades conflict with those resolved in the SC). The number of terminals in each clade (in an unrooted sense, with the smallest possible clade size reported) vs. the averaged percentage in the MRC for those clades is presented in Fig. 5. A negative relationship (least-square regression slope = 0.96x; r 2 = 0.21) was inferred in the combined supermatrix results, whereas a very weak positive

M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 185 Fig. 4. Results from the fourth contrived example plotted on the BS or MRC for three selected clades. Percentages in the majority-rule consensus (including those <50%) are at the left edge of each branch. BS support is above each branch while JK support is below each branch. The maximum number of trees held in a given pseudoreplicate increases from left to right (1, 1000, all). The BS/JK supports for MEGA, PHYLIP, and SeaView analyses are at their closest inferred positions. If a clade was not resolved in a given P50% BS or JK tree then it is reported as. %in majority-rule consensus # of terminals in clade Fig. 5. Number of terminals per clade vs. percentage in majority-rule consensus for that clade for all single-locus analyses (filled diamonds) and supermatrix analyses (open triangles). Error bars are ±95% confidence intervals. relationship was inferred in the combined single-locus results (least-square regression slope = 0.19x; r 2 = 0.02). The number of terminals in each clade vs. the frequency of the FWR or undersampling-within-replicates artifact for those clades (defined by P50% support from two or more of the four treesearch strategies) is presented in Fig. 6a and b for the single-locus and supermatrix results, respectively. A slightly negative relationship was inferred in all single-locus (PAUP BS = 0.009x, r 2 = 0.18; PAUP JK = 0.009x, r 2 = 0.19; TNT BS = 0.006x, r 2 = 0.16; TNT JK = 0.008x, r 2 = 0.14) and the PAUP supermatrix results (BS = 0.008x, r 2 = 0.06; JK = 0.012x, r 2 = 0.14), while a stronger negative relationship was inferred the TNT supermatrix results (BS = 0.028x, r 2 = 0.62; JK = 0.031x, r 2 = 0.74). The percentage in the MRC (binned at 5% intervals) vs. the frequency of the FWR or undersampling-within-replicates artifact for those clades is presented in Fig. 6c and d for the single-locus and supermatrix results, respectively. A clear positive relationship was inferred in all supermatrix results (PAUP BS = 0.009x, r 2 = 0.83; PAUP JK = 0.009x, r 2 = 0.78; TNT BS = 0.012x, r 2 = 0.85; TNT JK = 0.012x, r 2 = 0.82), while a weaker positive relationship was inferred in the single-locus results (PAUP BS = 0.001x, r 2 = 0.18; PAUP JK = 0.003x, r 2 = 0.22; TNT BS = 0.001x, r 2 = 0.07; TNT JK = 0.003x, r 2 = 0.22). The percentage in the MRC vs. the scaled support for those clades (as defined in Section 2.5, averaged across all four treesearch strategies) is presented in Fig. 6e and f for the single-locus and supermatrix results, respectively. These comparisons were limited to those clades resolved by both BS and JK within PAUP for the PAUP results and those resolved by both BS and JK within TNT for the TNT results. The comparisons were limited in this manner so as to better compare BS and JK results within each program without biasing the averages based on inclusion of clades with weak support in JK trees that are entirely absent in P50% BS trees. A clear positive relationship was inferred in all supermatrix results (PAUP BS = 0.006x, r 2 = 0.94; PAUP JK = 0.006x, r 2 = 0.91; TNT BS = 0.005x, r 2 = 0.93; TNT JK = 0.005x, r 2 = 0.93), while a weak positive relationship was inferred in the single-locus results (PAUP BS = 0.003x, r 2 = 0.72; PAUP JK = 0.005x, r 2 = 0.22; TNT BS = 0.0005x, r 2 = 0.002; TNT JK = 0.002x, r 2 = 0.04). The tree-search strategy within each pseudoreplicate, with increasing thoroughness and number of trees held, vs. the number of occurrences of the FWR or undersampling-within-replicates artifact is presented in Fig. 7a and b for the single-locus and supermatrix results, respectively. Increasing the number of RAS/TBR searches performed from 1 to 10, and consequently the number of trees held from 1 to 610, increased the number of occurrences of the artifacts in all PAUP results, and to a lesser degree in the TNT supermatrix results. Further increasing the number of trees held generally decreased the number of occurrences of the artifacts in both PAUP and TNT results. Unlike PAUP, the number of occurrences of the undersampling-within-replicates artifact in the

186 M.P. Simmons, J.V. Freudenstein / Molecular Phylogenetics and Evolution 61 (2011) 177 191 (a) (b) single-locus frequency of artifact frequency of artifact supermatrix # of terminals in clade # of terminals in clade (c) frequency of artifact PAUP* bootstrap PAUP* jackknife TNT bootstrap TNT jackknife (d) frequency of artifact % in majority-rule consensus % in majority-rule consensus (e) (f) scaled support for artifact scaled support for artifact % in majority-rule consensus % in majority-rule consensus Fig. 6. Pairwise comparisons between three potentially correlated factors in the single-locus (a, c, and e) and supermatrix (b, d, and f) analyses. (a and b) number of terminals in clade vs. frequency of artifacts; (c and d) percentage in majority-rule consensus vs. frequency of artifacts; (e and f) percentage in majority-rule consensus vs. scaled support for artifacts. Symbols used are explained in (c). Error bars are ±95% confidence intervals. supermatrix results precipitously decreased for TNT when increasing the maximum number of trees held from 10 to 10,000 per RAS/ TBR search. Essentially the same results were obtained when the tree-search strategy within each pseudoreplicate was compared to the number of occurrences of the FWR or undersampling-withinreplicates artifact scaled to support for those clades (Fig. 7c and d). 4. Discussion The contrived examples were deliberately created to show how extreme the FWR and undersampling-within-replicates artifacts could be. Yet the supermatrix studies demonstrated that the same artifacts could lead to 90 + % BS/JK values for unsupported clades