This manual introduces you to these methods by taking as a starting point the results that can be obtained by applying them. In this way it is possible to guide you one step at the time, from a straightforward type of analysis to the more complicated ones.
CAFCA output has 6 parts.
You will get a guided tour along these items, using a simple example of a data matrix with 5 taxa, Aus, Bus, Cus, Dus, and Eus, and 8 characters. The data matrix is available in the examples folder on your distribution disk.
The first analysis will be very straightforward, using the program's defaults for the major parameters. The data matrix used is complex enough to introduce the several possibilities of CAFCA as to coding of characters in a binary matrix, but on the other hand also simple enough from the point of phylogenetic structure as not to allow many competing best cladograms. The same data matrix will be used in the next examples as well, to illustrate the effect of changing some of the parameters.
What follows is a discussion of the results obtained in this first primary analysis, and at the same time an explanaion of the concepts, jargon, and peculiarities of CAFCA.
CAFCA - Mac version 1.3i (c) M.Z. 1987, 1995 Date : 1 AUG 1995 Time : 9H30M28S CAFCA Parameter Settings Type of analysis ...................... Primary Cladon option ......................... 1: Partial Monothetic Sets (PMS) Cladogram Selection Criterion ......... Minimum Length Taxon on outgroup-node ................ Ancestral zero's in character ......... 4 5 10 Ordered characters .................... 10 Maximum Number of Cladograms .......... 2
Table 3.1: Header of CAFCA output of first example.
The header section (table 3.1) recapitulates the values of the major CAFCA parameters, as set before the analysis, either by you or by the program (defaults).
Cladograms for taxa are derived as general patterns of internested (hierarchical) groups of taxa, emerging from the combination of the particular (independent) pattern in each separate character.
Secondary analyses serve to resolve polytomies in cladograms resulting from a previous primary or biogeographical analysis or user-tree evaluation where the set of building blocks for cladograms contained insufficient information for complete dichotomous resolutions (see chapter 4).
Biogeographic analyses are run to explore the relations between the phylogeny of a group of taxa, or different phylogenies of different (unrelated) groups, and the geographical distribution of the taxa involved (see chapter 6).
This type of analysis can also be used to explore the historical relationships between parasites and hosts, by considering hosts as areas of endemism for parasites. In fact any co-evolutionary pattern can be studied for its historical implications by this type of analysis. You may even consider taxa as areas of endemism for genes (character state expressions). The phylogeny of the taxa is seen as the general pattern emerging from the separate phylogenies of independent genes (character carriers), just as a general pattern for the historical relations among areas of endemism emerges from the separate phylogenies of independent taxa. That's the reason why in CAFCA primary, secondary, and biogeographic analyses are identical as to the method employed.
User-tree evaluation takes place when you have entered a data matrix plus one or more cladograms that must be evaluated against this data matrix. The cladograms usually come from the literature, or are based on intuition, but are as a rule not directly derived from the data matrix itself (see also chapter 5).
User-trees need not be completely resolved. If they are unresolved (i.e., contain polytomies) they can be subjected to a secondary analysis after evaluation.
Another possible use of user-tree evaluation may result from running a primary analysis on a data matrix containing the 'better' characters that, however, do not give a completely resolved cladogram. After saving, this cladogram can be entered as a user-tree and evaluated against another data matrix containing the 'weaker' characters, and consequently subjected to a secondary analysis on the basis of the 'weaker' characters.
User-tree evaluation is also applied in co-evolutionary studies. In those cases an independent estimate of the host phylogeny may be available. This host phylogeny is evaluated against the cladogram(s) for hosts found from the data matrix based on the parasite phylogeny and the distribution of parasites over hosts.
The cladon option refers to the way the building blocks for cladograms (clada) are defined. In the first example these clada are defined following the partial definition for monothetic sets.
In CAFCA you can choose from six different cladogram selection criteria. In the first example the default option, cladogram length, is chosen.
In a multi-state character a zero entry is interpreted as an indication of a (putative) ancestral state (see the paragraphs on 'assumptions regarding zero's in the data matrix').
In case a multi-state data matrix is used as input, the program will ask if these characters (none, some, all; if some then which) should be treated as ordered, that is, seen as an a priori polarised and ordered sequence of states. CAFCA can order multi-state characters only linearly in a sorted sequence (0 -> 1 -> 2 -> 3, etc...). Thus if you want the states ordered like 2 -> 1 -> 3 -> 0 you should first renumber them to 0, 1, 2, and 3, respectively.
Binary characters (0/1) are seen as characters with only one state (1) as CAFCA groups taxa as a result of presence (= 1) of states only (see also the assumptions regarding zero's in the data matrix). This state should best indicate a putative apomorphy if true phylogenetic results are required If such putative polarisation is impossible or unwarranted you should transform the binary character (0/1) to a multi-state one (1/2), or you should click No for 'Ancestral state indicated by zero' in the CAFCA parameter dialog. In the latter case groups of terminal taxa will also be based on the distribution of zero's as these zero's may now represent an apomorphic state.
So, if you want to implement an a priori polarisation plus ordering of character states you should either apply binary matrices with (incomplete) additive binary coding, or enter a multi-state data matrix and enforce a linear ordering for all or some of the characters when the program prompts you to do so.
In the present example character 10 shows a linear ordering of its states in the binary image of the data matrix (table 3.2).
In the CAFCA parameter dialog box you can declare what the maximum number of cladograms (MNC) should be for which results will be retained in memory. In the header this parameter is represented by its declared value. Note that this number is different from the maximum number of cliques of components that CAFCA stores during its clique search (a built-in maximum of 5000).
Character states must be represented by digits (integers). Other symbols, like items from the alphabet, are not allowed. Missing values are allowed and must be indicated by a negative integer or a question mark.
In the analysis a binary representation of the data matrix is used for almost all computations (the cladogram optimisation algorithm uses a multi-state representation). This implies that if you define a multi-state matrix as input a copy of this matrix will be converted into a binary image (with the postfix ∆B added to its name).
The elements of the column partitioning vector (CPV) indicate how the columns (character states) of the binary data matrix should be taken together blockwise. Each block of columns corresponds to 1 character, i.e., a transformation series (= 1 column in the multi-state data matrix); each column in a block represents one character state.
This procedure of treating the binary representation of multi-state characters as blocks of interdependent states avoids the errors that are introduced when each state of a multi-state character is treated as a separate nominal variable (Pimentel and Riggins, 1987).
If you define a binary data matrix as input, you will be prompted to provide a column partitioning vector to let the program know how the character states (columns) should be grouped together, successively, to derive a multi-state matrix. If you enter a multi-state data matrix as input, the program can derive a column partitioning vector for the binary image.
The data matrix of our first example (table 3.2) was copied from an ASCII file as a binary data matrix (PLANTB.INP from the Xmpls folder). There are 3 characters with only 1 state (# 1, 7, and 8), 4 characters with two states (# 2, 3, 4, and 9), and 3 characters with 3 states (# 5, 6, and 10).
Data Matrix (binary) : PLANT (Columns represent character states) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 +------------------------------------------------------------ Aus | 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 Bus | 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 0 Cus | 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 Dus | 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 Eus | 1 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 Column Partitioning Vector : 1 2 2 2 3 3 1 1 2 3 Data Matrix (multi-state) : PLANT (Columns represent characters) 1 2 3 4 5 6 7 8 9 10 +------------------------------ Aus | 1 1 2 0 0 3 0 1 2 0 Bus | 1 2 1 1 1 1 0 1 2 1 Cus | 1 2 1 2 2 2 0 1 2 2 Dus | 1 2 1 2 2 2 1 0 3 3 Eus | 1 2 1 2 3 2 1 0 1 3
Table 3.2: Data matrix for a cladistic character analysis.
Apparently there is a contradiction between some elements in the CPV (# 4, 5, 9) and the actual number of states in the characters. However, a zero in a multi-state character (like in # 4 and 5) is not treated as a separate state but as an indication of a (putative) ancestral condition (see assumptions regarding zero's). In character 9 state 3 in the multi-state matrix actually reflects a polytypism (see below) for state 1 and 2 in taxon 4 (columns 16 and 17 in binary image).
Note character 10 which in its binary image is additively coded (= a priori polarised and ordered). If you want to implement a priori polarisation plus ordering of character states in a particular character you should apply a binary matrix with (incomplete) additive coding in the block for that character, or enter a multi-state data matrix and enforce a linear ordering for all or some of the characters when the program asks you to do so.
In the case of characters with only one state (characters 1, 7, and 8) this state should best indicate a putative apomorphy if true phylogenetic results are required as CAFCA groups taxa as a result of presence of states only. If such putative polarisation is unwarranted you should transform the binary character (0/1) to a multi-state one (1/2), or you should click No for 'Ancestral state indicated by zero' in the CAFCA parameter dialog. In the latter case, groups of terminal taxa will also be based on the distribution of zero's as these zero's may now represent an apomorphic state. In all other cases characters are treated as unordered and unpolarized (characters 2 - 6, and 9), unless all states in a block are (incompletely) additive binary coded (character 10; columns 18, 19, 20 in binary image).
In the present example taxon 4 shows polytypism for character 9. No separate column for this state is present in the binary data matrix. The multi-state image of the binary matrix, however, shows a distinct code (3) for this polytypism that can be traced as such in the state change list (see page 45). Note, however, that CAFCA has no provision to deal with polytypism in internal nodes (hypothetical ancestral taxa) of the cladogram. It simply does a most parsimonious assignment, according to accelerated transformation (ACCTRAN) of a character state to an internal node.
For identical indicators of missing values for several taxa, say, three taxa all showing a -2, all possible combinations of these taxa with the taxa showing known states with value 2 will be used in the derivation of building-blocks for cladograms. This is likewise true for taxa showing -1 with those showing 1 as a known state, those showing -3 with those showing 3, etc... This procedure implies that a data matrix with one column indicating identical missing values (e.g., -1) for all taxa will result in all cladograms possible given the number of taxa (the default is 6 taxa, implying 945 cladograms; you can indicate otherwise in the CAFCA parameter dialog. The maximum number possible is 12, although using it is quite absurd when you realise the number of cladograms (13.749.310.575) that are implied by this number. You may tie up CAFCA for years.
Thus if you know nothing, i.e., you have no data on your taxa, all possible outcomes are equally likely and will be presented (within limits).
In ASCII files representing data matrices you can also use a question mark as an indicator of a missing value. When CAFCA imports these files the question marks are translated to -1.
Although using monothetic sets and variations thereof in the recognition of building blocks for cladograms, the group- and component compatibility method should not be confused with the so-called monothetic group method as discussed by Farris, Kluge and Mickevich (1982). In contrast to the latter method CAFCA does not depend on a priori specification of transformation series and polarities of characters.
Given a multi-state and a binary character for the taxa A to H, like for instance
1 1 1 0 2 1 3 0 3 0 2 1 1 1 2 1the unordered representation in the binary data matrix will be the following blocks of character states:
11 12 13 21 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1from which, assuming that zero in the binary character implies an ancestral condition, the following list of clada is derived:
ABG, CFH, DE, and ACFGH;
from character states 11, 12, 13, and 21, respectively.
If the apomorphy decision for 1 or zero in binary character #2 is still undecided, the list of clada is supplemented by set {BDE} based on character state 20.
Strict monothetic defined clada reflect the first signs of doubt as to the homology conjectures implied by partial monothetic sets. Strict sets say, as it were, that if the initial conjectures of homology are doubtful than a first hint of how these homologies may be broken down is given by the distribution (over taxa) of other states from other characters (congruence).
Given binary characters for the taxa A to H, like for instance
1 2 3 4 (characters) 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1the following list of clada is generated under option 1 (PMS):
ABD, DGH, ABCEG, and ACFGH;
from character 3, 4, 1, and 2, respectively (zero's assumed to indicate the ancestral condition).
Using option 2 (SMS) the following clada are generated as well:
AB, GH, and ACG.
AB results from the interaction among characters 1 and 3 (if ABD is not based on a character state, homologous over taxa, then maybe AB ìs), GH from characters 2 and 4, and ACG from characters 1 and 2. In this way we are still limiting the number of homoplasies to account for if character state distributions are not fully congruent. For instance, the set GH is not broken down in G and H separately unless there is evidence supplied by other characters that we should do so.
As we should not burden our analysis with hypotheses of homoplasy beyond necessity (Hennig's auxiliary principle), we usually take partial monothetic sets for clada (option 1) as first approximations in our attempts to achieve a fully resolved and parsimonious explanation of our data in terms of a cladogram. This approximation can be made better, if need be, by using strict monothetic sets (option 2) in another attempt to achieve fully resolved most parsimonious cladograms.
Wilkinson (1995) describes a strategy, safe taxonomic reduction, to cope with abundant missing entries in a data matrix. Through this strategy only taxa that can have no effect upon the inferred relationships of other taxa included in the analysis are excluded prior to analysis. According to Wilkinson a minimum requirement for the inclusion of any terminal taxon to alter relationships among the other terminal taxa is that it must have unique combinations of phylogenetically informative characters. Taxa that have the same combination of character states, so-called taxonomic equivalents, can be safely removed from the data matrix and are potential candidates for elimination prior to analysis. As described above, unique combinations of character states can be found through the application of the definition of (strict) monothetic sets.
We could extend the notion of what is considered a minimal requirement for terminal taxa to apply to internal nodes of a cladogram as well. Thus any component that is a strict monothetic set will affect relationships among the other components. As a corollary one may contemplate how to judge and what to do with components from a MPC that appear to have identical sets of character state combinations.
Instead of breaking down clada into subsets as indicated by overlapping character states as SMS does, this option finds new clada by iteratively joining the distributions of pairwise overlapping character states. For instance the binary data used in the example above (option 2) to derive strict monothetic sets result in the following sets after the first iteration.
{ABCEFGH} 1 + 2
{ABCDEG} 1 + 3
{ABCDEGH} 1 + 4
{ABCDFGH} 2 + 3
{ACDFGH} 2 + 4
{ABDGH} 3 + 4In the second iteration these joint distributions are combined among each other as well as with the distributions of the original character states, e.g., {1 + 2} will be combined with {3} and with {4}, as well as with {3 + 4), etc. The iterations stop until no new combinations of sets are found.
This option # 3 can also be applied to strict monothetic sets. If you chose option 2 (SMS) right from the beginning, you will be prompted by CAFCA whether you want to add the complementary codes to the SMS's as well.
Given a multistate character for the taxa A to H, like for instance
A 1 B 1 C 2 D 3 E 3 F 2 G 1 H 2the unordered representation in the binary data matrix will be the following block of character states:
11 12 13 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0from which the following permutations of additive binary codings (transformation series) are derived
1 2 3 1 3 2 2 1 3 2 3 1 3 2 1 3 1 2 (series) 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 (states) A 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 B 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 C 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 D 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 E 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 F 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 G 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 H 1 1 0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1as well as the codes for the branched varieties
These binary codes are used to derive partial monothetic sets, as shown under option 1 above, in addition to the sets already obtained from the original binary codes of the character states in the data matrix.
Thus the list already obtained in option 1, ABG, CFH, and DE, is supplemented with the clada ABCDEFGH, CDEFH, AB, ABDEG, and ABCFGH.
A 0 B 1 C 1 D 1 E 0 F 0 G 1 H 0if this representation contains six 1's or less (6 as a default; you can indicate more if you want to in the CAFCA parameter dialog). Make all groupings of taxa (duo's, trio's, quartet's, etc...) based on the distribution of state present indication (1), e.g.,
BC, BD, BG, CD, CG, DG, BCD, BCG, BDG, CDG, and BCDG
Thus, for instance, the groups BC, D, and BCD in (BC)D must have independent (= not identical) supporting character states for the three-taxon statement to be considered valid (all characters in the data matrix are used to this end). But the same must be true for BCD, G, and BCDG in (BCD)G, etc...
|
|
|
By generating three-taxon statements we can, within practical limits, explore the situation where, according to Wilkinson (1991) parsimony analysis will not be misled if "... for any pair of sister taxa A and B there is more reliable evidence of their membership in a series of nested holophyletic groups to the exclusion of any unrelated taxon C, than there is misleading counterevidence for the inclusion of either A or B in an alternative set of nested groups to the exclusion of the other. "
Nelson and Platnick (1991) suggest another implementation of three-taxon statements. They only consider all pairs of taxa, and disregard groupings of higher order, that can be derived from a list of taxa sharing the same state of a character. To form three-taxon statements these pairs are united with all other taxa not sharing this state (i.e. having a zero), one at the time. This is repeated for each separate character. In this way a new data matrix is build, composed of the three-taxon statements implied by the characters in the original data matrix. Note that my implementation of three-taxon statements does not replace the original data matrix but only serves to provide additional building blocks for cladograms. Nelson and Platnick's definition of three-taxon statements is also treated in this way by CAFCA. Only if you export the data matrix in NEXUS, PAUP, or HENNIG86 format the N&P three-taxon codes will replace the original data matrix. If you opt for three-taxon-statements in the CAFCA parameters dialog you will be offered a choice between Nelson and Platnick's implementation and CAFCA's.