PHYLIP DNA parsimony (dnapars) GUI Free Download - PHYLIP 3.695
Looking for:
Phylip free download for windows free -PHYLIP Home Page.phylip-software [ILRI Research Computing]
In the meantime, I may not be able to devote time to searching for new programs, so their authors are begged to please! That form will be found at the "Submitting" link below. If you are upset that your program is not included, but it's too much trouble for you to fill out the submission form, then I will not listen to you.
If anyone else wants to help with this, let me know. Owing to past NSF support of these pages, I am required to note that any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation NSF supported these pages from Notices added in compliance with University of Washington requirements for web sites hosted at the University: Privacy Terms.
By computer. Data types. Web servers. New programs. Waiting list. Other lists. Old programs. Not listed. L ist of packages arranged If you make an error in the format of the input file, the programs can sometimes detect that they have been fed an illegal character or illegal numerical value and issue an error message such as BAD CHARACTER STATE: , often printing out the bad value, and sometimes the number of the species and character in which it occurred.
The program will then stop shortly after. One of the things which can lead to a bad value is the omission of something earlier in the file, or the insertion of something superfluous, which cause the reading of the file to get out of synchronization.
The program then starts reading things it didn't expect, and concludes that they are in error. So if you see this error message, you may also want to look for the earlier problem that may have led to the program becoming confused about what it is reading. Some options are described below, but you should also read the documentation for the groups of the programs and for the individual programs. The Menu The menu is straightforward. It typically looks like this this one is for Dnapars : DNA parsimony algorithm, version 3.
Yes S Search option? More thorough search V Number of trees to save? Use input order O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony?
No, use ordinary parsimony N Use Transversion parsimony? No, count all steps W Sites weighted? No M Analyze multiple data sets? No I Input sequences interleaved? Yes Y to accept these or type the letter for one to change If you want to accept the default settings they are shown in the above case you can simply type Y followed by pressing on the Enter key. If you want to change any of the options, you should type the letter shown to the left of its entry in the menu.
For example, to set a threshold type T. Lower-case letters will also work. For many of the options the program will ask for supplementary information, such as the value of the threshold. Note the Terminal type entry, which you will find on all menus. It allows you to specify which type of terminal your screen is.
Choosing zero 0 toggles among these three options in cyclical order, changing each time the 0 option is chosen. If one of them is right for your terminal the screen will be cleared before the menu is displayed.
If none works, the none option should probably be chosen. The programs should start with a terminal option appropriate for your computer, but if they do not, you can change the terminal type manually. This is particularly important in program Retree where a tree is displayed on the screen - if the terminal type is set to the wrong value, the tree can look very strange. The other numbered options control which information the program will display on your screen or on the output files.
The option to Print indications of progress of run will show information such as the names of the species as they are successively added to the tree, and the progress of rearrangements. You will usually want to see these as reassurance that the program is running and to help you estimate how long it will take.
But if you are running the program "in background" as can be done on multitasking and multiuser systems, and do not have the program running in its own window, you may want to turn this option off so that it does not disturb your use of the computer while the program is running. Note also menu option 3, "Print out tree".
This can be useful when you are running many data sets, and will be using the resulting trees from the output tree file. It may be helpful to turn off the printing out of the trees in that case, particularly if those files would be too big. The Output File Most of the programs write their output onto a file called usually outfile , and a representation of the trees found onto a file called outtree. The exact contents of the output file vary from program to program and also depend on which menu options you have selected.
For many programs, if you select all possible output information, the output will consist of 1 the name of the program and its version number, 2 some of the input information printed out, and 3 a series of phylogenies, some with associated information indicating how much change there was in each character or on each part of the tree.
The numbers at the forks are arbitrary and are used if present merely to identify the forks. For many of the programs the tree produced is unrooted.
Rooted and unrooted trees are printed in nearly the same form, but the unrooted ones are accompanied by the warning message: remember: this is an unrooted tree! Mathematicians still call an unrooted tree a tree, though some systematists unfortunately use the term "network" for an unrooted tree. This conflicts with standard mathematical usage, which reserves the name "network" for a completely different kind of graph.
The root of this tree could be anywhere, say on the line leading immediately to Mouse. It is important also to realize that the lengths of the segments of the printed tree may not be significant: some may actually represent branches of zero length, in the sense that there is no evidence that those branches are nonzero in length.
Some of the diagrams of trees attempt to print branches approximately proportional to estimated branch lengths, while in others the lengths are purely conventional and are presented just to make the topology visible.
You will have to look closely at the documentation that accompanies each program to see what it presents and what is known about the lengths of the branches on the tree. The above tree attempts to represent branch lengths approximately in the diagram. But even in those cases, some of the smaller branches are likely to be artificially lengthened to make the tree topology clearer. When a tree has branch lengths, it will be accompanied by a table showing for each branch the numbers or names of the nodes at each end of the branch, and the length of that branch.
For the first tree shown above, the corresponding table is: Between And Length Approx. Confidence Limits 1 Bovine 0.
Similar tables exist in distance matrix and likelihood programs, as well as in the parsimony programs Dnapars and Pars. Some of the parsimony programs in the package can print out a table of the number of steps that different characters or sites require on the tree.
This table may not be obvious at first. Thus site 23 is column "3" of row "20" and has 1 step in this case. There are many other kinds of information that can appear in the output file, They vary from program to program, and we leave their description to the documentation files for the specific programs. The Tree File In output from most programs, a representation of the tree is also written into the tree file outtree.
The tree is specified by nested pairs of parentheses, enclosing names and separated by commas. We will describe how this works below. Trailing blanks in the name may be omitted. The pattern of the parentheses indicates the pattern of the tree by having each pair of parentheses enclose all the members of a monophyletic group. The tree file could look like this: Mouse,Bovine , Gibbon, Orang, Gorilla, Chimp,Human ; In this tree the first fork separates the lineage leading to Mouse and Bovine from the lineage leading to the rest.
Within the latter group there is a fork separating Gibbon from the rest, and so on. The entire tree is enclosed in an outermost pair of parentheses. The tree ends with a semicolon. In some programs such as Dnaml, Fitch, and Contml, the tree will be unrooted.
The single three-way split corresponds to one of the interior nodes of the unrooted tree it can be any interior node of the tree. The remaining forks are encountered as you move out from that first node.
In newer programs, some are able to tolerate these other forks being multifurcations multi-way splits. You should check the documentation files for the particular programs you are using to see in which of these forms you can expect the user tree to be in. Note that many of the programs that actually estimate an unrooted tree such as Dnapars produce trees in the treefile in rooted form!
This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. We are working toward having all programs be able to read all trees, whether rooted or unrooted, multifurcating or bifurcating, and having them do the right thing with them.
But this is a long-term goal and it is not yet achieved. For programs that infer branch lengths, these are given in the trees in the tree file as real numbers following a colon, and placed immediately after the group descended from that branch. Here is a typical tree with branch lengths: cat These representations of trees are a subset of the standard adopted on 24 June at the annual meetings of the Society for the Study of Evolution by an informal committee its final session in Newick's lobster restaurant - hence its name, the Newick standard consisting of Wayne Maddison author of MacClade , David Swofford PAUP , F.
Day, and me. This standard is a generalization of PHYLIP's format, itself based on a well-known representation of trees in terms of parenthesis patterns which is due to the famous mathematician Arthur Cayley, and which has been around for over a century. The standard is now employed by most phylogeny computer programs but unfortunately has yet to be decribed in a formal published description.
Options are selected in the menu. Common options in the menu A number of the options from the menu, the U User tree , G Global , J Jumble , O Outgroup , W Weights , T Threshold , M multiple data sets , and the tree output options, are used so widely that it is best to discuss them in this document. The U User tree option. This option toggles between the default setting, which allows the program to search for the best tree, and the User tree setting, which reads a tree or trees "user trees" from the input tree file and evaluates them.
The input tree file's default name is intree. In many cases the programs will also tolerate having the trees be preceded by a line giving the number of trees: Alligator,Bear , Cow, Dog,Elephant ,Ferret ; Alligator,Bear , Cow,Dog ,Elephant ,Ferret ; Alligator,Bear , Cow,Dog , Elephant,Ferret ; An initial line with the number of trees was formerly required, but this now can be omitted.
Some programs require rooted trees, some unrooted trees, and some can handle multifurcating trees. You should read the documentation for the particular program to find out which it requires. Program Retree can be used to convert trees among these forms on saving a tree from Retree, you are asked whether you want it to be rooted or unrooted. In using the user tree option, check the pattern of parentheses carefully. The programs do not always detect whether the tree makes sense, and if it does not there will probably be a crash hopefully, but not inevitably, with an error message indicating the nature of the problem.
Trees written out by programs are typically in the proper form. The G Global option. In the programs which construct trees except for Neighbor, the " In most of these programs the rearrangements are automatically global, which in this case means that subtrees will be removed from the tree and put back on in all possible ways so as to have a better chance of finding a better tree. Since this can be time consuming it roughly triples the time taken for a run it is left as an option in some of the programs, specifically Contml, Fitch, Dnaml and Proml.
In these programs the G menu option toggles between the default of local rearrangement and global rearrangement. The rearrangements are explained more below. The J Jumble option. In most of the tree construction programs except for the " In these programs J option enables you to tell the program to use a random number generator to choose the input order of species.
This option is toggled on and off by selecting option J in the menu. The program will then prompt you for a "seed" for the random number generator. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.
The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run. Some people have asked what are good values of the random number seed. The random number seed is used to start a process of choosing "random" actually pseudorandom numbers, which behave as if they were unpredictably randomly chosen between 0 and 2 32 -1 which is 4,,, You could put in the number and find that the next random number was ,, However if you re-use a random number seed, the sequence of random numbers that result will be the same as before, resulting in exactly the same series of choices, which may not be what you want.
The O Outgroup option. This specifies which species is to have the root of the tree be on the line leading to it. For example, if the outgroup is a species "Mouse" then the root of the tree will be placed in the middle of the branch which is connected to this species, with Mouse branching off on one side of the root and the lineage leading to the rest of the tree on the other. This option is toggled on and off by choosing O in the menu the alphabetic character O , not the digit 0.
When it is on, the program will then prompt for the number of the outgroup the species being taken in the numerical order that they occur in the input file. Responding by typing 6 and then an Enter character indicates that the sixth species in the data the 6th in the first set of data if there are multiple data sets is taken as the outgroup.
Outgroup-rooting will not be attempted if the data have already established a root for the tree from some other consideration, and may not be if it is a user-defined tree, despite your invoking the option. Thus programs such as Dollop that produce only rooted trees do not allow the Outgroup option. It is also not available in Kitsch, Dnamlk, Promlk or Clique.
When it is used, the tree as printed out is still listed as being an unrooted tree, though the outgroup is connected to the bottommost node so that it is easy to visually convert the tree into rooted form. The T Threshold option. This sets a threshold forn the parsimony programs such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps.
The default is a threshold so high that it will never be surpassed in which case the steps whill simply be counted. The T menu option toggles on and off asking the user to supply a threshold. The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described in my b paper.
When the T option is in force, the program will prompt for the numerical threshold value. This will be a positive real number greater than 1. In programs Dollop, Dolmove, and Dolpenny the threshold should never be 0.
The T option is an important and underutilized one: it is, for example, the only way in this package except for program Dnacomp to do a compatibility analysis when there are missing data. It is a method of de-weighting characters that evolve rapidly. I wish more people were aware of its properties. The M Multiple data sets option. In menu programs there is an M menu option which allows one to toggle on the multiple data sets option.
The program will ask you how many data sets it should expect. The data sets have the same format as the first data set. Using the program Seqboot one can take any DNA, protein, restriction sites, gene frequency or binary character data set and make multiple data sets by bootstrapping.
Trees can be produced for all of these using the M option. They will be written on the tree output file if that option is left in force. Then the program Consense can be used with that tree file as its input file. The result is a majority rule consensus tree which can be used to make confidence intervals. The present version of the package allows, with the use of Seqboot and Consense and the M option, bootstrapping of many of the methods in the package.
Programs Dnaml, Dnapars and Pars can also take multiple weights instead of multiple data sets. They can then do bootstrapping by reading in one data set, together with a file of weights that show how the characters or sites are reweighted in each bootstrap sample.
Thus a site that is omitted in a bootstrap sample has effectively been given weight 0, while a site that has been duplicated has effectively been given weight 2. Seqboot has a menu selection to produce the file of weights information automatically, instead of producing a file of multiple data sets. It can be renamed and used as the input weights file.
The W Weights option. This signals the program that, in addition to the data set, you want to read in a series of weights that tell how many times each character is to be counted.
If the weight for a character is zero 0 then that character is in effect to be omitted when the tree is evaluated. If it is 1 the character is to be counted once. Some programs allow weights greater than 1 as well. These have the effect that the character is counted as if it were present that many times, so that a weight of 4 means that the character is counted 4 times.
The values give weights 0 through 9, and the values A-Z give weights 10 through By use of the weights we can give overwhelming weight to some characters, and drop others from the analysis. In the molecular sequence programs only two values of the weights, 0 or 1 are allowed. The weights are used to analyze subsets of the characters, and also can be used for resampling of the data as in bootstrap and jackknife resampling. For those programs that allow weights to be greater than 1, they can also be used to emphasize information from some characters more strongly than others.
Of course, you must have some rationale for doing this. The weights are provided as a sequence of digits. Thus they might be The weights are to be provided in an input file whose default name is weights. The weights in it are a simple string of digits. Blanks in the weightfile are skipped over and ignored, and the weights can continue to a new line.
In programs such as Seqboot that can also output a file of weights, the input weights have a default file name of inweights , and the output file name has a default file name of outweights. Weights can be used to analyze different subsets of characters by weighting the rest as zero. Alternatively, in the discrete characters programs they can be used to force a certain group to appear on the phylogeny in effect confining consideration to only phylogenies containing that group.
This is done by adding an imaginary character that has 1 's for the members of the group, and 0 's for all the other species. That imaginary character is then given the highest weight possible: the result will be that any phylogeny that does not contain that group will be penalized by such a heavy amount that it will not except in the most unusual circumstances be considered. Of course, the new character brings extra steps to the tree, but the number of these can be calculated in advance and subtracted out of the total when reporting the results.
This use of weights is an important one, and one sadly ignored by many users who could profit from it. In the case of molecular sequences we cannot use weights this way, so that to force a given group to appear we have to add a large extra segment of sites to the molecule, with say A's for that group and C's for every other species.
The option to write out the trees into a tree file. This specifies that you want the program to write out the tree not only on its usual output, but also onto a file in nested-parenthesis notation as described above.
This option is sufficiently useful that it is turned on by default in all programs that allow it. You can optionally turn it off if you wish, by typing the appropriate number from the menu it varies from program to program. This option is useful for creating tree files that can be directly read into the programs, including the consensus tree and tree distance programs, and the tree plotting programs. The output tree file has a default name of outtree.
The 0 terminal type option. This is the digit 0 , not the alphabetic character O. This affects the ability of the programs to clear the screen when they display their menus, and the graphics characters used to display trees in the programs Dnamove, Move, Dolmove, and Retree. The Algorithm for Constructing Trees All of the programs except Factor, Dnadist, Gendist, Dnainvar, Seqboot, Contrast, Retree, and the plotting and consensus tree programs act to construct an estimate of a phylogeny.
Move, Dolmove, and Dnamove let you construct it yourself by hand. All of the rest but Neighbor, the " They are trying to minimize or maximize some quantity over the space of all possible evolutionary trees. Each program contains a part that, given the topology of the tree, evaluates the quantity that is being minimized or maximized. The straightforward approach would be to evaluate all possible tree topologies one after another and pick the one which, according to the criterion being used, is best.
This would not be possible for more than a small number of species, since the number of possible tree topologies is enormous. A review of the literature on the counting of evolutionary trees will be found one of my papers Felsenstein, a and in my book Felsenstein, , chapter 3.
Since we cannot search all topologies, these programs are not guaranteed to always find the best tree, although they seem to do quite well in practice. The strategy they employ is as follows: the species are taken in the order in which they appear in the input file. The first two in some programs the first three are taken and a tree constructed containing only those.
There is only one possible topology for this tree. Then the next species is taken, and we consider where it might be added to the tree.
If the initial tree is say a rooted tree with two species and we want the resulting three-species tree to be a bifurcating tree, there are only three places where we could add the third species. Each of these is tried, and each time the resulting tree is evaluated according to the criterion. The best one is chosen to be the basis for further operations. Now we consider adding the fourth species, again at each of the five possible places that would result in a bifurcating tree.
Again, the best of these is accepted. This is usually known as the Sequential Addition strategy. Local rearrangements The process continues in this manner, with one important exception. After each species is added, and before the next is added, a number of rearrangements of the tree are tried, in an effort to improve it. The algorithms move through the tree, making all possible local rearrangements of the tree. A local rearrangement involves an internal segment of the tree in the following manner.
Each time a local rearrangement is successful in finding a better tree, the new arrangement is accepted. The phase of local rearrangements does not end until the program can traverse the entire tree, attempting local rearrangements, without finding any that improve the tree.
This strategy of adding species and making local rearrangements will look at about n-1 x 2n-3 different topologies, though if rearrangements are frequently successful the number may be larger. I have been describing the strategy when rooted trees are being considered. For unrooted trees there is a precisely similar strategy, though the first tree constructed may be a three-species tree and the rearrangements may not start until after the addition of the fifth species.
Though we are not guaranteed to have found the best tree topology, we are guaranteed that no nearby topology i. In this sense we have reached a local optimum of our criterion. Note that the whole process is dependent on the order in which the species are present in the input file. We can try to find a different and better solution by reordering the species in the input file and running the program again or, more easily, by using the J option.
If none of these attempts finds a better solution, then we have some indication that we may have found the best topology, though we can never be certain of this. Note also that a new topology is never accepted unless it is better than the previous one, so that the rearrangement process can never fall into an endless loop. This is also the way ties in our criterion are resolved, namely by sticking with the tree found first.
However, the tree construction programs other than Clique, Contml, Fitch, and Dnaml do keep a record of all trees found that are tied with the best one found. This gives you some immediate idea of which parts of the tree can be altered without affecting the quality of the result. In the others it automatically applies. When it is present there is an additional stage to the search for the best tree.
Each possible subtree is removed from the tree from the tree and added back in all possible places. This process continues until all subtrees can be removed and added again without any improvement in the tree. The purpose of this extra rearrangement is to make it less likely that one or more a species gets "stuck" in a suboptimal region of the space of all possible trees.
The use of global optimization results in approximately a tripling 3 x of the run-time, which is why I have left it as an option in some of the slower programs.
My book Felsenstein, , chapter 4 contains a review of work on these and other rearrangements and search methods. The programs doing global optimization print out a dot ". A new line of dots is started whenever a new round of global rearrangements is started following an improvement in the tree. On the line before the dots are printed there is printed a bar of the form "! The dots will not be printed out at a uniform rate, but the later dots, which represent removal of larger groups from the tree and trying them consequently in fewer places, will print out more quickly.
With some compilers each row of dots may not be printed out until it is complete. It should be noted that Penny, Dolpenny, Dnapenny and Clique use a more sophisticated strategy of "depth-first search" with a "branch and bound" search method that guarantees that all of the best trees will be found. In the case of Penny, Dolpenny and Dnapenny there can be a considerable sacrifice of computer time if the number of species is greater than about ten: it is a matter for you to consider whether it is worth it for you to guarantee finding all the most parsimonious trees, and that depends on how much free computer time you have!
Clique finds all largest cliques, and does so without undue burning of computer time. Although all of these problems that have been investigated fall into the category of "NP-hard" problems that in effect do not have a rapid solution, the cases that cause this trouble for the largest-cliques algorithm in Clique apparently are not biologically realistic and do not occur in actual data.
Multiple jumbles As just mentioned, for most of these programs the search depends on the order in which the species are entered into the tree. Using the J Jumble option you can supply a random number seed which will allow the program to put the species in in a random order.
Jumbling can be done multiple times. For example, if you tell the program to do it 10 times, it will go through the tree-building process 10 times, each with a different random order of adding species. It will keep a record of the trees tied for best over the whole process. In other words, it does not just record the best trees from each of the 10 runs, but records the best ones overall.
Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees.
In the terminology of Maddison it can find different "islands" of trees. The present algorithms do not guarantee us to find all trees in a given "island" from a single run, so multiple runs also help explore those "islands" that are found. Saving multiple tied trees For the parsimony and compatibility programs, one can have a perfect tie between two or more trees. In these programs these trees are all saved. For the newer parsimony programs such as Dnapars and Pars, global rearrangement is carried out on all of these tied trees.
This can be turned off in the menu. For trees with criteria which are real numbers, such as the distance matrix programs Fitch and Kitsch, and the likelihood programs Dnaml, Dnamlk, Contml, and Restml, it is difficult to get an exact tie between trees. Consequently these programs save only the single best tree even though the others may be only a tiny bit worse.
Strategy for finding the best tree In practice, it is advisable to use the Jumble option to evaluate many different orderings of the input species. It is advisable to use the Jumble option and specify that it be done many times as many as different orderings of the input species. This is usually not necessary when bootstrapping, though the programs will then default to doing it once to avoid artifacts caused by the order in which species are added to the tree.
People who want a magic "black box" program whose results they do not have to question or think about often are upset that these programs give results that are dependent on the order in which the species are entered in the data. To me this property is an advantage, for it permits you to try different searches for better trees, simply by varying the input order of species. If you do not use the multiple Jumble option, but do multiple individual runs instead, you can easily decide which to pay most attention to - the one or ones that are best according to the criterion employed for example, with parsimony, the one out of the runs that results in the tree with the fewest changes.
In practice, in a single run, it usually seems best to put species that are likely to be sources of confusion in the topology last, as by the time they are added the arrangement of the earlier species will have stabilized into a good configuration, and then the last few species will by fitted into that topology. There will be less chance this way of a poor initial topology that would affect all subsequent parts of the search. However, a variety of arrangements of the input order of species should be tried, as can be done if the J option is used, and no species should be kept in a fixed place in the order of input.
Note that the results of the " Note also that with global search, which is standard in many programs and in others is an option, each group including each individual species will be removed and re-added in all possible positions, so that a species causing confusion will have more chance of moving to a new location than it would without global rearrangement. Nixon's search strategy An innovative search strategy was developed by Kevin Nixon If one uses a manual rearrangement program such as Dnamove, Move, or Dolmove, and look at the distribution of characters on the trees, you will see some characters whose distributions appear to recommend alternative groupings.
One would want a program that automatically found such alternative suggestions and used them to rearrange the tree so as to explore trees that had those groups. Nixon had the idea of using resampling methods to do this. Using either bootstrap or jackknife sampling, one can make data sets that emphasize randomly sampled subsets of characters.
We then search for trees that fit those data sets. After finding them, we revert to the initial data set and then search using those trees as starting points. This sampling allows us to explore parts of tree space recommended by particular subsets of characters. This is not exactly Nixon's original strategy, which started the searches for each resampled data set from the best tree found so far.
For each resampled data set we instead start from scratch, doing sequential addition of taxa. Nixon's method has proven to be very effective in searching for most parsimonious trees -- it is currently the state of the art for that. Nixon called his method the "parsimony ratchet", but actually it can be applied straightforwardly to any method of phylogeny inference that has an optimality criterion, including likelihood and least squares distance methods.
Starting with version 3. This makes it possible to implement our variant of Nixon's strategy. You need to do so in multiple steps: Use bootstrap sampling to make a number of resampled versions of the data set. You can also use jackknifing.
Take these replicates, and do quick estimates of the phylogeny for each one. This could be done with faster methods such as neighbor-joining or parsimony. Take the resulting trees, together with the original data set. Using the method of phylogeny estimation that you prefer, read the trees in as multiple user-defined trees, choosing the choice in the U menu option that uses these trees as the starting point for rearrangement. The program will report the best tree or trees found by rearranging all of those input trees.
This accomplishes Nixon's search strategy. It will not necessarily be fast to do this, as the last step may be slow. But the resampling will cause emphasis on different sets of characters in the initial searches, allowing the process to explore regions of tree space not usually examined by conventional rearrangement strategies.
There is some more information on how this may be done in the documentation files for Seqboot and for the individual tree inference programs. A Warning on Interpreting Results Probably the most important thing to keep in mind while running any of the parsimony or compatibility programs is not to overinterpret the result.
Some users treat the set of most parsimonious trees as if it were a confidence interval. If a group appears in all of the most parsimonious trees then they treat it as well established. Unfortunately the confidence interval on phylogenies appears to be much larger than the set of all most parsimonious trees Felsenstein, b. Likewise, variation of result among different methods will not be a good indicator of the size of the confidence interval. Many different methods will all give the same result on such a data set: they will estimate the tree as A,B , C,D.
Nevertheless it is clear that the margin by which this tree is favored is not statistically significantly different from So consistency among different methods is a poor guide to statistical significance. Relative Speed of Different Programs and Machines Relative speed of the different programs C compilers differ in efficiency of the code they generate, and some deal with some features of the language better than with others.
Thus a program which is unusually fast on one computer may be unusually slow on another. Nevertheless, as a rough guide to relative execution speeds, I have tested the programs on three data sets, each of which has 10 species and 40 characters.
Farris once called ones like it. The second is the binary recoded form of the fossil horses data set of Camin and Sokal The data sets thus range from a completely compatible one in which there is no homoplasy paralellism or convergence , through the horses data set, which requires 29 steps where the possible minimum number would be 20, to the random data set, which requires 49 steps.
We can thus see how this increasing messiness of the data affects running times. The three data sets have all had 20 sites of A 's added to the end of each sequence, so as to prevent likelihood or distance matrix programs from having infinite branch lengths the test data sets used for timing previous versions of PHYLIP were the same except that they lacked these 20 extra sites.
The data sets used for the discrete characters programs have 0 's and 1 's instead of A 's and C 's. For Contml the A 's and C 's were made into 0. For the distance programs 10 x 10 distance matrices were computed from the three data sets. It does not make much sense to benchmark Move, Dolmove, or Dnamove, although when there are many characters and many species the response time after each alteration of the tree should be proportional to the product of the number of species and the number of characters.
For Dnaml, Dnamlk, and Dnadist the frequencies of the four bases were set to be equal rather than determined empirically as is the default. For Restml the number of enzymes was set to 1. In most cases, the benchmark was made more accurate by analyzing data sets using the M Multiple data sets option and dividing the resulting time by Times were determined as user times using the Linux time command.
Several patterns will be apparent from this. The algorithms Mix, Dollop, Contml, Fitch, Kitsch, Protpars, Dnapars, Dnacomp, and Dnaml, Dnamlk, Restml that use the above-described addition strategy have run times that do not depend strongly on the messiness of the data.
The only exception to this is that if a data set such as the Random data requires extra rounds of global rearrangements it takes longer. The programs differ greatly in run time: the protein likelihood programs Proml and Promlk were very slow, and the other likelihood programs Restml, Dnaml and Contml are slower than the rest of the programs.
The protein sequence parsimony program, which has to do a considerable amount of bookkeeping to keep track of which amino acids can mutate to each other, is also relatively slow. Another class of algorithms includes Penny, Dolpenny, Dnapenny and Clique. This is apparent with Penny, Dolpenny, and Dnapenny, which go from being reasonably fast with clean data to very slow with messy data.
Dolpenny is particularly slow on messy data - this is because this algorithm cannot make use of some of the lower-bound calculations that are possible with Dnapenny and Penny. Clique is very fast on all data sets.
Although in theory it should bog down if the number of cliques in the data is very large, that does not happen with random data, which in fact has few cliques and those small ones. Apparently the "worst-case" data sets that cause exponential run time are much rarer for Clique than for the other branch-and-bound methods.
Neighbor is quite fast compared to Fitch and Kitsch, and should make it possible to run much larger cases, although the results are expected to be a bit rougher than with those programs.
Speed with different numbers of species How will the speed depend on the number of species and the number of characters? For the sequential-addition algorithms, the speed should be proportional to somewhere between the cube of the number of species and the square of the number of species, and to the number of characters. Thus a case that has, instead of 10 species and 20 characters, 20 species and 50 characters would take in the cubic case 2 x 2 x 2 x 2.
This implies that cases with more than 20 species will be slow, and cases with more than 40 species very slow. This places a premium on working on small subproblems rather than just dumping a whole large data set into the programs. An exception to these rules will be some of the DNA programs that use an aliasing device to save execution time.
In these programs execution time will not necessarily increase proportional to the number of sites, as sites that show the same pattern of nucleotides will be detected as identical and the calculations for them will be done only once, which does not lead to more execution time.
This is particularly likely to happen with few species and many sites, or with data sets that have small amounts of evolutionary divergence. For programs Fitch and Kitsch, the distance matrix is square, so that when we double the number of species we also double the number of "characters", so that running times will go up as the fourth power of the number of species rather than the third power.
Thus a species case with Fitch is expected to run sixteen times more slowly than a species case. For programs like Penny and Clique the run times will rise faster than the cube of the number of species in fact, they can rise faster than any power since these algorithms are not guaranteed to work in polynomial time.
In practice, Penny will frequently bog down above 11 species, while Clique easily deals with larger numbers. For Neighbor the speed should vary only as the cube of the number of species, so a case twice as large will take only eight times as long. This will make it an attractive alternative to Fitch and Kitsch for large data sets. Suggestion: If you are unsure of how long a program will take, try it first on a few species, then work your way up until you get a feel for the speed and for what size programs you can afford to run.
Execution time is not the most important criterion for a program, particularly as computer time gets much cheaper than your time or a programmer's time. With workstations on which background jobs can be run all night, execution speed is not overwhelmingly relevant. Some of us have been conditioned by an earlier era of computing to consider execution speed paramount.
But ease of use, ease of adaptation to your computer system, and ease of modification are much more important in practice, and in these respects I think these programs are adequate.
Only if you are engaged in 's style mainframe computing, or if you have very large amounts of data is minimization of execution time paramount. If you spent six months getting your data, it may not be overwhelmingly important whether your run takes 10 seconds or 10 hours. Nevertheless it would have been nice to have made the programs faster. The present speeds are a compromise between speed and effectiveness: by making them slower and trying more rearrangements in the trees, or by enumerating all possible trees, I could have made the programs more likely to find the best tree.
By trying fewer rearrangements I could have speeded them up, but at the cost of finding worse trees. I could also have speeded them up by writing critical sections in assembly language, but this would have sacrificed ease of distribution to new computer systems.
There are also some options included in these programs that make it harder to adopt some of the economies of bookkeeping that make other programs faster. However to some extent I have simply made the decision not to spend time trying to speed up program bookkeeping when there were new likelihood and statistical methods to be developed.
Relative speed of different machines It is interesting to compare different machines using Dnapars as the standard task. One can rate a machine on the Dnapars benchmark by summing the times for all three of the data sets.
Here are relative total timings over all three data sets done with various versions of Dnapars for some machines, taking an AMD Athlon 1. Benchmarks from versions 3. They are compared only with each other and are scaled to the rest of the timings using the joint runs on the SX and the Pentium MMX This use of separate standards is necessary not because of different languages but because different versions of the package are being compared.
Thus, the "Time" is the ratio of the Total to that for the Pentium, adjusted by the scalings of machines using 3. The Relative Speed is the reciprocal of the Time.
For the moment these benchmarks are for version 3. The numerical programs benchmark below gives them a fairer test. Note that parallel machines like the Sequent and the SGI PowerChallenge are not really as slow as indicated by the data here, as these runs did nothing to take advantage of their parallelism.
These benchmarks have now extended over 22 years , and in the Dnapars benchmark they extend over a range of over 54,fold in speed! The experience of our laboratory, which seems typical, is that computer power grows by a factor of about 1.
This is roughly consistent with these benchmarks. For a picture of speeds for a more numerically intensive program, here are benchmarks using Dnaml, with an AMD Athlon 1. Numbers are total run times total user time in the case of Unix over all three data sets. You are invited to send me figures for your machine for inclusion in future tables. Use the data sets above and compute the total times for Dnapars and for Dnaml for the three data sets setting the frequencies of the four bases to 0.
If the times are too small to be measured accurately, obtain the times for 10 or data sets the Multiple data sets option and divide by 10 or General Comments on Adapting the Package to Different Computer Systems In the sections following you will find instructions on how to adapt the programs to different computers and compilers.
The programs should compile without alteration on most versions of C. They use the "malloc" library or "calloc" function to allocate memory so that the upper limits on how many species or how many sites or characters they can run is set by the system memory available to that memory-allocation function. In the document file for each program, I have supplied a small input example, and the output it produces, to help you check whether the programs are running properly.
This can be easy under Linux and Unix, but more difficult if you have a Macintosh or a Windows system. If you have the latter, we strongly recommend you download and use the Macintosh and Windows executables that we distribute.
If you do that, you will not need to have any compiler or to do any compiling. I get a certain number of inquiries each year from confused users who are not sure what a compiler is but think they need one. After downloading the executables they contact me and complain that they did not find a compiler included in the package, and would I please e-mail them the compiler.
What they really need to do is use the executables and forget about compiling them. Some users may also need to compile the programs in order to modify them. The instructions below will help with this. This is usually easy to do. Unix and Linux systems generally have a C compiler and have the make utility.
We use GNU 's make utility , which might be installed on your system as "make" or as "gmake". However, note that some popular Linux distributions do not include a C compiler in their default configuration. The following instructions assume that you have the C compiler and X libraries. As is mentioned below under Macintoshes the Mac OS X operating system is a Unix, and if the X windows windowing system is installed, these Unix instructions will work for it.
After you have finished unpacking the Documentation and Source Code archive, you will find that you have created a folder phylip There is also an HTML web page, phylip. The exe folder will be empty, src contains the source code files, including the Makefile. Directory doc contains the documentation files. Enter the src folder. Before you compile, you will want to look at the Makefile and see whether you want to alter the compilation command.
We have the default C compiler flags set with no flags. If you have modified the programs, you might want to use the debugging flags "-g". On the other hand, if you are trying to make a fast executable using the GCC compiler, you may want to use the one which is "An optimized one for gcc". There are careful instructions on this in the Makefile. If these are warnings, rather than errors, they are not too serious.
A typical warning would be like this: dnaml. If you have done a make install the system will then move the executables into the exe folder and also save space by erasing all the relocatable object files that were produced in the process. You should be left with useable executables in the exe folder, and the src folder should be as before.
To run the executables, go into the exe folder and type the program name say dnaml , which you may or may not have to precede by a dot and a slash. The names of the executables will be the same as the names of the C programs, but without the. Thus dnaml. These are provided with most X Windows installations. If you see messages that the compilation could not find "Xlib. Similarly, if you get error messages saying that some files with "Xaw" in the name cannot be found, this means that the Athena Widgets are not installed on your system, or are not installed in the default location.
In either case, you will need to make sure that they are installed properly. In some Linux systems it is not invoked by the command cc but by gcc. You would then need to edit the Makefile to reflect this see below for comments on that process. A typical Unix or Linux installation would put the directory phylip The font files font1 through font6 could also be placed there. It has a table of all of the documentation pages, including this one. If users create a bookmark to that page it can be used to access all of the other documentation pages.
To compile just one program, such as Dnaml, type: make dnaml After this compilation, dnaml will be in the src subdirectory. So will some relocatable object code files that were used to create the executable. These have names ending in. If you have problems with the compilation command, you can edit the Makefile. It has careful explanations at its front of how you might want to do so. For example, you might want to change the C compiler name cc to the name of the Gnu C compiler, gcc.
This can be done by removing the comment character from the front of one line, and placing it at the front of a nearby line. How to do so should be clear from the material at the beginning of the Makefile. We have encountered some problems with the Gnu C Compiler gcc on bit Itanium processors when compiled with the the -O 3 optimization level, in our code for generating random numbers.
Some older C compilers notably the Berkeley C compiler which is included free with some Sun systems do not adhere to the ANSI C standard because they were written before it was set down.
They have trouble with the function prototypes which are in our programs. We have included an ifndef preprocessor command to eliminate the problem, if you use the switch -DOLDC when compiling. Thus with these compilers you need only use this in your C flags in the Makefile and compilers such as Berkeley C will cause no trouble.
Windows systems We distribute Windows executables, and most likely you can use these and do not need to recompile them. The following instructions will only be necessary if you want to modify the programs and need to recompile them. They are given for several different compilers available on Windows systems.
Another major compiler is Intel compiler -- we do not have information yet on how to use it, but expect that PHYLIP will compile on it.
Phylip free download for windows free. Index of /phylip/download
It has been distributed since , and has over 30, registered users, making it the most widely distributed package of phylogeny programs. It can infer phylogenies by parsimony, compatibility, distance matrix methods, and likelihood.
It can also compute consensus trees, compute distances between trees, draw trees, resample data sets by bootstrapping or jackknifing, edit trees, and compute distance matrices. It can handle data that are nucleotide sequences, protein sequences, gene frequencies, restriction sites, restriction fragments, distances, discrete characters, and continuous characters. University of Washington. All rights reserved. Permission is granted to reproduce, perform, and modify these programs and documentation files.
Permission is granted to distribute or provide access to these programs provided that this copyright notice is not removed, the programs are not integrated with or called by any product or service that generates revenue, and that your distribution of these documentation files and programs are free. Any modified versions of these materials that are distributed or accessible shall indicate that they are based on these programs.
Institutions of higher education are granted permission to distribute this material to their students and staff for a fee to recover distribution costs.
Permission requests for any other distribution of these programs should be directed to license at u. These include the main documentation file this one , which you should read fairly completely.
In addition there are files for groups of programs, including ones for the molecular sequence programs, the distance matrix programs, the gene frequency and continuous characters programs, the discrete characters programs, and the tree drawing programs.
Finally, each program has its own documentation file. References for the documentation files are all gathered together in this main documentation file. A good strategy is to: Read this main documentation file. Tentatively decide which programs are of interest to you. Read the documentation files for the groups of programs that contain those. Read the documentation files for those individual programs. What The Programs Do Here is a short description of each of the programs.
For more detailed discussion you should definitely read the documentation file for the individual program and the documentation file for the group of programs it is in.
In this list the name of each program is a link which will take you to the documentation file for that program. Clique Finds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states.
The largest clique or all cliques within a given size range of the largest one are found by a very fast branch and bound search method. The method does not allow for missing data.
For such cases the T Threshold option of Pars or Mix may be a useful alternative. Compatibility methods are particular useful when some characters are of poor quality and the rest of good quality, but when it is not known in advance which ones are which. Consense Computes consensus trees by the majority-rule consensus tree method, which also allows one to easily find the strict consensus tree. Is not able to compute the Adams consensus tree. Trees are input in a tree file in standard nested-parenthesis notation, which is produced by many of the tree estimation programs in the package.
This program can be used as the final step in doing bootstrap analyses for many of the methods in the package. Contml Estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations. Does not assume a molecular clock. An alternative method of analyzing this data is to compute Nei's genetic distance and use one of the distance matrix programs. This program can also do maximum likelihood analysis of continuous characters that evolve by a Brownian Motion model, but it assumes that the characters evolve at equal rates and in an uncorrelated fashion, so that it does not take into account the usual correlations of characters.
Contrast Reads a tree from a tree file, and a data set with continuous characters data, and produces the independent contrasts for those characters, for use in any multivariate statistics package. Will also produce covariances, regressions and correlations between characters for those contrasts. Can also correct for within-species sampling variation when individual phenotypes are available within a population.
Dnacomp Estimates phylogenies from nucleic acid sequence data using the compatibility criterion, which searches for the largest number of sites which could have all states nucleotides uniquely evolved on the same tree. Compatibility is particularly appropriate when sites vary greatly in their rates of evolution, but we do not know in advance which are the less reliable ones.
Dnadist Computes four different distances between species from nucleic acid sequences. The distances can then be used in the distance matrix programs. The distances are the Jukes-Cantor formula, one based on Kimura's 2- parameter method, the F84 model used in Dnaml, and the LogDet distance.
The distances can also be corrected for gamma-distributed and gamma-plus-invariant-sites-distributed rates of change in different sites. Rates of evolution can vary among sites in a prespecified way, and also according to a Hidden Markov model.
The program can also make a table of Dnainvar For nucleic acid sequence data on four species, computes Lake's and Cavender's phylogenetic invariants, which test alternative tree topologies. The program also tabulates the frequencies of occurrence of the different nucleotide patterns. Lake's invariants are the method which he calls "evolutionary parsimony". Dnaml Estimates phylogenies from nucleotide sequences by maximum likelihood.
The model employed allows for unequal expected frequencies of the four nucleotides, for unequal rates of transitions and transversions, and for different prespecified rates of change in different categories of sites, and also use of a Hidden Markov model of rates, with the program inferring which sites have which rates.
This also allows gamma-distribution and gamma-plus-invariant sites distributions of rates across sites. Dnamlk Same as Dnaml but assumes a molecular clock. The use of the two programs together permits a likelihood ratio test of the molecular clock hypothesis to be made. Dnamove Interactive construction of phylogenies from nucleic acid sequences, with their evaluation by parsimony and compatibility and the display of reconstructed ancestral bases.
This can be used to find parsimony or compatibility estimates by hand. Dnapars Estimates phylogenies by the parsimony method using nucleic acid sequences. Allows use the full IUB ambiguity codes, and estimates ancestral nucleotide states.
Gaps treated as a fifth nucleotide state. It can also do transversion parsimony. Dnapenny Finds all most parsimonious phylogenies for nucleic acid sequences by branch-and-bound search. This may not be practical depending on the data for more than species or so. Dollop Estimates phylogenies by the Dollo or polymorphism parsimony criteria for discrete character data with two states 0 and 1.
Also reconstructs ancestral states and allows weighting of characters. Dollo parsimony is particularly appropriate for restriction sites data; with ancestor states specified as unknown it may be appropriate for restriction fragments data.
Dolmove Interactive construction of phylogenies from discrete character data with two states 0 and 1 using the Dollo or polymorphism parsimony criteria. Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. Dolpenny Finds all most parsimonious phylogenies for discrete-character data with two states, for the Dollo or polymorphism parsimony criteria using the branch-and-bound method of exact search.
May be impractical depending on the data for more than species. Drawgram Plots rooted phylogenies, cladograms, circular trees and phenograms in a wide variety of user-controllable formats. The program is interactive. It has an interface in the Java language which gives it a closely similar menu on all three major operating systems.
Final output can be to a file formatted for one of the drawing programs, for a ray-tracing or VRML browser, or one at can be sent to a laser printer such as Postscript or PCL-compatible printers , on graphics screens or terminals, on pen plotters or on dot matrix printers capable of graphics. Many of these formats are historic so we no longer have hardware to test them. If you find a problem please report it.
Drawtree Similar to Drawgram but plots unrooted phylogenies. It also has a Java interface for previews. Factor Takes discrete multistate data with character state trees and produces the corresponding data set with two states 0 and 1. Written by Christopher Meacham. This program was formerly used to accomodate multistate characters in Mix, but this is less necessary now that Pars is available.
Fitch Estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. Uses the Fitch-Margoliash criterion and some related least squares criteria, or the Minimum Evolution distance matrix method. Does not assume an evolutionary clock. This program will be useful with distances computed from molecular sequences, restriction sites or fragments distances, with DNA hybridization measurements, and with genetic distances computed from gene frequencies.
Gendist Computes one of three different genetic distance formulas from gene frequency data. The formulas are Nei's genetic distance, the Cavalli-Sforza chord measure, and the genetic distance of Reynolds et. The former is appropriate for data in which new mutations occur in an infinite isoalleles neutral mutation model, the latter two for a model without mutation and with pure genetic drift.
The distances are written to a file in a format appropriate for input to the distance matrix programs. Kitsch Estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed.
The Fitch-Margoliash criterion and other least squares criteria, or the Minimum Evolution criterion are possible. This program will be useful with distances computed from molecular sequences, restriction sites or fragments distances, with distances from DNA hybridization measurements, and with genetic distances computed from gene frequencies.
Mix Estimates phylogenies by some parsimony methods for discrete character data with two states 0 and 1. Allows use of the Wagner parsimony method, the Camin-Sokal parsimony method, or arbitrary mixtures of these.
Also reconstructs ancestral states and allows weighting of characters does not infer branch lengths. Move Interactive construction of phylogenies from discrete character data with two states 0 and 1. Neighbor Joining is a distance matrix method producing an unrooted tree without the assumption of a clock. UPGMA does assume a clock. The branch lengths are not optimized by the least squares criterion but the methods are very fast and thus can handle much larger data sets.
Pars Multistate discrete-characters parsimony method. Up to 8 states as well as "?
Comments
Post a Comment