Character simulations and randomizations,
and their use in hypothesis testing
Mesquite can simulate evolution of
organismal characteristics, including DNA sequences. With these
tools, you can build statistical tests of hypotheses.
This section provides a brief introduction to some of the analyses
you can do with Mesquite, including:
Example files for this chapter are in the "Simulations"
folder within the "examples" folder within Mesquite_Folder.
Simulating DNA sequence evolution
on a tree
First, let's try evolving DNA sequences up the branches of a
model tree, thereby creating an entire simulated matrix. Open
the example file "modelTree.nex" and
you will be presented with a tree. For the moment it doesn't matter
where this tree came from; the important thing to note is that
it has branch lengths inferred from a matrix of 18S rDNA; this
matrix is also included in the file.
In order to simulate DNA sequences, you need:
- A tree, with branch lengths specified. A tree with branch
lengths is included in this example file.
- A specification of how the DNA sequences evolve. This model
could have many elements. One common form of model specifies
the relative frequencies of A, C, G, and T; the relative rates
of change of the different characters; and the relative rates
of change between A and C, A and G, and so on.
Creating a model of character evolution
For sake of simplicity, let's presume that A, C, G, and T occur
at equal frequencies, and that all pairs of nucleotides have equal
rates of change between them. To make it interesting, though,
let's presume that some characters evolve more quickly than others,
with the distribution of rates of change following a discrete
gamma distribution with shape parameter 1.5. If one presumes (as
we will) that there were four categories of rates, then this model
presumes that one quarter of the characters have rate 0.225, one
quarter 0.589, one quarter 1.05, and the rest 2.136, as shown
in the following figure:

To create this model of character rate variation in Mesquite,
choose (Tree)
Characters>New Character Submodel>Gamma Rate Variation Model...
You will be presented with a dialog box in which you can choose
the name of the model you will create; you might call it "my
gamma model". You will then be presented with a dialog box
allowing you to specify the nature of the model; set the shape
parameter to 1.5:

Now we need to create a composite model of character evolution
that specifies both this gamma rate variation model, the state
frequency model, and the state-to-state rate model. To do this,
choose (Tree)
Characters>New Character Model>Composite DNA Simulation
Model..., name the model (perhaps call it "my
full model"), and you will be presented with a dialog box
in which you can choose the various submodels:

The only element of this model we will change is
the character rates model; choose "my gamma model" from
the drop-down menu to select the gamma distribution model you
created as the model of site-to-site rate variation. The model
"my full model" is now ready to use.
Note that the dialog box in which you edited "my
full model" has button with a ? on it (
).
When a button like this is present in a Mesquite window, touching
on it will display a small note with information about how to
use the window's feature.
Simulating evolution
Now choose (Tree)
Characters>Make New Matrix from>Simulated Matrices on Current
Tree. This will ask Mesquite to simulate evolution
of characters up the current tree in the tree window to produce
a new matrix. You will be asked what sort of character simulator
to use. As we wish to evolve a DNA sequence up the tree's branches,
choose "Evolve DNA characters" from the list. (There
might be other options presented, such as Evolve Continuous Characters,
but we don't wish to try that now.) You will then need to chose
the model of DNA sequence evolution; choose the composite one
you created earlier called "full model". When it queries
for the number of character, choose a large number, such as 2000.
When it asks for the name of the matrix to be created, call it
"simulated" (although you could choose another name,
if you wish). Mesquite will then simulate evolution, and a matrix
will be produced, such as this one:

Examining the results of the simulation
Let's see whether the rate variation between characters
suggested by this simulated matrix matches a gamma distribution
with shape parameter 1.5, as used in the model. First, save the
matrix to a file so that you can read it into PAUP*. Choose (Character
Matrix) Characters>Save Copy of Matrix>simulated
to save a copy of your newly created matrix; perhaps name the
file "Simulated Matrix".
In PAUP* 4, execute the file Simulated Matrix, and
then execute the following commands:
nj;
lset nst=1 basefreq=equal rates=gamma shape=estimate;
lscore 1;
(If you don't know how to give PAUP* commands, please
read your PAUP* documentation. There are equivalent ways to give
these commands in the MacOS versions using dialog boxes, but it
is much easier for us to describe what to do with these text commands.)
The first command ("nj;") will find a
neighbor-joining tree. While this might not be the best way to
get a tree, but it will be fast, and give a tree that is good
enough for our purposes. The second command ("lset...")
sets the model used for likelihood calculations to be the model
used by Mesquite in the simulations, except that the shape parameter
used is not specified, but is instead estimated from the data.
The third command ("lscore 1;") tells PAUP* to calculate
the likelihood of the tree, in the course of which it will also
estimate the shape parameter of the gamma distribution using likelhood.
You might end up with PAUP* reporting something like this:
Tree 1
-------------------
-ln L 19742.05025
Shape 1.583552
The shape parameter should be around 1.5.
Building more complex models
More complex composite models of DNA evolution can
be built by choosing among the following:
Models of root state frequencies:
- Equal Frequencies: equal frequencies of states
- Empirical Frequencies: frequencies of states the same as that
found in an existing character matrix
- User-specified Nucleotide Frequencies: frequencies of nucleotides
specified by the user. You can create a model of this sort by
choosing (Character
Matrix) Characters>New Character Submodel>User-specified
Nucleotide Frequencies... and then entering the appropriate
values in the dialog box.
Models of equilibrium state frequencies on other
branches:
- Equal Frequencies: equal frequences of states
- Empirical Frequencies: frequencies of states the same as that
found in an existing character matrix
- User-specified Nucleotide Frequencies: frequencies of nucleotides
specified by the user. You can create a model of this sort by
choosing (Character
Matrix) Characters>New Character Submodel>User-specified
Nucleotide Frequencies... and then entering the appropriate
values in the dialog box.
The above two submodels corresponds in PAUP* to
the specifying the base frequencies in the likelihood settings;
note that PAUP* does not allow separate specification of the model
of frequences and the root and at other branches.
Models of rate variation among characters:
- Equal Rates: all characters evolve at the same rate.
- Codon Position Rates Model: a model specifying the relative
rates of the different codon positions. If you wish to have
the codon positions match those in an existing matrix, then
codon positions need to be specified for a DNA matrix. To do
this, choose (Character
Matrix) Characters>List of Characters to see the
current codon positions. Select the characters in this list
window, and then touch on the title of the column "Codon
Position". A drop-down menu will appear in which you can
choose to set the codon positions to a sequence like 123123123...
as appropriate.
- Gamma Rates Model: a model specifying that rates of characters
evolve according to a gamma distribution.
- Gamma Invar Rates Model: a model specifying that a proportion
of the characters are invariant, and the remainder follow a
gamma distribution.
- Proportion Invariant Model: a model specifying that a proportion
of the characters are invariant, and the remainder evolve at
one rate.
The first model ("Equal Rates") is built
in; you can create a model of the other kinds by choosing the
appropriate items from the (Character
Matrix) Characters>New Character Submodel menu and
then entering the values in the dialog box.
Model of rate matrices:
- Single Rate: all changes between nucleotides occur at the
same rate.
- Ti/Tv Rate Matrix Model: the two-parameter rate matrix model
in which transversions occur at a different rate than transversions.
- GTR Rate Matrix Model: the General Time Reversible, six-parameter
rate matrix model in which you can specify rates of each type
of nucleotide change.
The first model ("Single Rate") is built
in; you can create a model of the other kinds by choosing the
appropriate items from the (Character
Matrix) Characters>New Character Submodel menu and
then entering the values in the dialog box.
Simulating, exporting, and
analyzing multiple matrices
Mesquite has a facility to create multiple simulated
matrices, export them to files, and create one or more files that
provide instructions to programs to analyze the files produced.
These instruction files are called "batch files".
One could, for example, examine the quality of Mesquite's
simulation algorithms by simulating 100 matrices using a gamma
model of character rate variation. In addition to creating 100
files with matrices, Mesquite would create a batch file containing
instructions for PAUP* to estimate the gamma shape parameter for
each of the matrices and save the resulting values to a score
file. Mesquite would also create an file that would provide instructions
to Mesquite to allow it to read PAUP*'s score file and display
the resulting distribution of estimated gamma shape parameters
in a chart. As a concrete example, simulation of 1000 matrices
of 2000 characters each under a model with a gamma shape parameter
of 2.0, estimating of the shape parameter by PAUP* for each of
these matrices, and processing of the results by Mesquite yielded
the following histogram:

In this particular analysis, most matrices were
estimated to have a gamma shape parameter close to that used in
the model that generated the data; the average estimated gamma
shape parameter over the 1000 replicates was 2.017.
Conducting multiple simulations
To conduct an analysis similar to this one, open
up the example file "gammaTest.nex".
A complete model of DNA sequence evolution has already been added
to that file, under the name "gamma2model". It specifies,
among other components, character rate variation following a gamma
distribution with shape parameter 2.0. First, you will need to
ask Mesquite to evolve several matrices up the branches of the
phylogeny. Choose (Tree
Window) Analysis > Matrices & Batch Files > Export Matrices
& Batch Files.... You will be presented with a
choice of the source of matrices; choose "Simulated Matrices
on Current Tree", and press OK; for the character simulator,
choose "Evolve DNA characters". When asked to choose
the model, choose "gamma2model".
You will now be presented with the main dialog box
that you will need to master to use Mesquite's simulation tools.
It looks approximately like this:

The top part of the dialog box allows one to choose
the base part of the name of the simulated matrices that will
be produced, as well as to specify how many matrices will be produced
(the "number of replicates"). Let's call have the file
names begin with "gammaTest", so enter that for the
base name. In the interests of time, you can leave the number
of replicates be only 10 (a more thorough analysis would involve
many more replicates).
In addition to creating the matrices, Mesquite will
also produce one or two files accompanying those matrices that
specify commands to be used by various software (e.g. PAUP, NONA,
even Mesquite) to analyze those matrices. The manner in which
these accessory files or "batch files" are written is
controlled by the lower portion of the dialog box. The content
of the batch files is specified by a "batch file template".
There are some templates supplied by Mesquite; you can also build
your own. The template we need to use for this example is not
built into Mesquite; you will need to load it in from a file.
To do this, go into the Templates Manager, by pressing the Edit
Templates button. In the dialog that is presented, you can see
the build-in templates:

To load in the template we need to use, press the
Load button, and open the "GammaTest.template"
file that is in the "templates" folder within the Simulations
examples folder. The template Gamma Model Test will now be added
to the templates list. (You can see the nature of the template
by selecting it in this list and pressing "Edit", but
you needn't do this now.)
Once the template is loaded, press Done to return
to the Export Matrices & Batch Files dialog box. Make sure
"Gamma Model Test" is selected, and press "OK".
You will now be asked to choose a folder to store the created
files. As multiple files will be created, you may wish to create
a new, empty folder to house the files. Once the location is chosen,
you will be asked (after perhaps dismissing a notice or two) for
the number of characters to be simulated within each matrix. Choose
a fairly high number, such as 2000. Mesquite will now proceed
to create through simulation all 10 files, plus two batch files.
One of these batch files will be a file containing
PAUP commands that will instruct PAUP* to read in each matrix
file, quickly build a preliminary tree (using neighbor-joining,
which should be good enough for our purposes), and estimate the
gamma shape parameter using likelihood. The resulting value will
be saved to a PAUP "scorefile". Open up PAUP* 4 (make
sure you have the latest version of PAUP*), and ask it to execute
the file "gammaCommands.nex". After
PAUP has finished its analysis, go back to Mesquite, and choose
(Tree Window)
Analysis > Matrices & Batch Files > Show Results via
Instruction File.... You will be asked to first choose
an "instruction file", which is a text file telling
Mesquite how to interpret another file. Choose the file "MesquiteInstructions",
which is the second batch file created by the Gamma Model Test
template. After Mesquite processes the instruction file, it will
ask you to find the results file, which in this case is the score
file created by PAUP*. It should be called "gammaTestScore.scr"
(if the base name of the matrices was "gammaTest");
open it when asked for the results file. You will then be presented
with a histogram of the estimated gamma shape parameter values,
that will look approximately like this:

Of course, your exact values should differ from
these, but should cluster around 2.0. If you repeated this, but
with many more replicates, then you should see a bell-shaped curve,
as shown above.
Testing simulations of more complex models
A test of accuracy of Mesquite's more complex models
can also be done. For example, there are seven parameters in model
that includes a General Time Reversible model of DNA evolution
and assuming a proportion of the characters are invariant with
the remainder following a gamma distribution. In 1000 replicates
of 2000 characters evolved up the branches of a 49-taxon tree,
the average value of these parameters as estimated by PAUP* is
close (within 1.3 %) of the value used in model used in Mesquite,
as shown by the results of one particular analysis:
| Parameter |
Value in model |
Estimated value |
Difference |
| Rate(A<->C) |
1.8700052 |
1.87771 |
0.41% |
| Rate(A<->G) |
4.2537253 |
4.26319 |
0.22% |
| Rate(A<->T) |
2.5286454 |
2.53006 |
0.06% |
| Rate(C<->G) |
0.626305 |
0.63152 |
0.83% |
| Rate(C<->T) |
8.735118 |
8.77468 |
0.45% |
| proportion invariant |
0.506903 |
0.50903 |
0.42% |
gamma
shape |
0.438522 |
0.44422 |
1.30% |
Parametric Bootstrapping:
testing an hypothesis about phylogenetic structure
You can use Mesquite's Genesis package to build
your own statistical tests of various phylogenetic hypotheses.
To illustrate this, we will use an example from beetles. There
is an enigmatic group of terretrial beetles called the Trachypachidae
(left, below). These had been considered to be related to other
terrestrial beetles, but more recent analyses of morphological
data suggested that they were instead closely related to some
water beetles, the Dytiscoidea (right, below).


A terrestrial trachypachid (left) and an aquatic dytiscoid (right).
Shull et al. (2001) report sequence data of the
18S rRNA gene suggesting that trachypachids are not related to
dytiscoids, but are instead related to some other terrestrial
beetles. For example, in one analysis, the most parsimonious trees
showing trachypachids with dytiscoids are 9 steps longer than
unconstrained most parsimonious trees. We might ask if this difference
in length is expected under the hypothesis that trachypachids
and dytiscoids are related. (We could do a similar test asking
about differences in some measure other than treelength (e.g.,
likelihood), but the calculations would take longer than this
tutorial would allow.)
To test this hypothesis, we first need to flesh
out its details. It needs to be detailed enough to allow us to
predict the observations we would expect to see if this hypothesis
were true. First of all, we will need a detailed phylogenetic
hypothesis. We could find the tree of highest likelihood for 18S
rDNA under the constraint that trachypachids are with dytiscoids;
for one matrix, this tree is as shown below, with trachypachids
marked in green and dytiscoids in blue.

Branch lengths of this tree and parameter values
for a GTR + Gamma + Proportion Invariant (GTR+G+I) model of evolution
are estimated by maximum likelihood using the 18S rDNA data. We
thus have built a model that contains our best guess of the nature
of evolution presuming that trachypachids and dytiscoids form
a clade; the details of the model were established using the 18S
rDNA data.
We can now ask: if evolution occurred under a GTR+G+I
model with the parameter values as inferred, up the tree shown
above (with trachypachids and dytiscoids forming a clade), then
what would we expect the difference in treelength to be between
the most parsimonious trees constrained to have trachypachids
with dytiscoids and the most parsimonious unconstrained trees?
Would that difference in general be similar to the observed value,
9? Or is 9 an unexpected value? Specifically, is 9 a value that
we would expect to observe less than 0.05 of the time? If so,
then we could reject the hypothesis that trachypachids are related
to dytiscoids.
To conduct this test, open the example file "TestingTreeStructure.nex".
This file contains a matrix of 18S rDNA, and the tree of highest
likelihood found in which trachypachids are with dytiscids, with
branch lengths inferred from the data. It also contains a GTR+G+I
model with parameters inferred by PAUP* using maximum likelihood
and 18S rDNA. We will ask Mesquite to simulate 100 matrices, and
then (using the batch file Mesquite produces) ask PAUP* to find
the most parsimonious trees with trachypachids constrained to
be with dytiscoids and the most parsimonious unconstrained trees,
and write their treelengths to a scorefile. Mesquite will then
read in the results and calculate the distribution of treelength
differences, allowing us to determine if the observed value of
9 is unusual.
The simulations can be done by choosing (Tree
Window) Analysis > Matrices & Batch Files > Export Matrices
& Batch Files.... After selecting "Simulated
Matrices on Current Tree" and "Evolve DNA Characters,
choose the GTR+G+I model. You will then be presented with the
Export Matrices & Batch Files dialog box. Let's give a base
name for the matrices of "TDTest", and do 100 replicates.
We will use the template "PAUPConstraintTestParsimony":

If you wish to look at the contents of this template,
touch on "Edit Templates", and in the Template Manager,
select "PAUPConstraintTestParsimony", and press "View".
(As this template is built-in, you can't edit it, only view it.)
You will see some elements of the batch files that will be produced.

Don't worry about the complexity of this template;
details of how it works will be presented elsewhere. Just press
"OK" to get back to the Template Manager, and then press
"Done".
You are almost ready to have Mesquite simulate the
matrices. However, the template we are using requires that we
specify a tree to use as a constraint tree. In this case, the
constraint tree is one with trachypachids with dytiscoids, but
with no other structure. A tree like this is stored in the file
under the name "TD constraint". To choose it, touch
the "Choose Tree" button, and select "TD constraint".
All the options have now been chosen. Press the
OK button in the Export Matrices and Batch Files dialog box to
start the simulation. You will be asked for a location to save
the files, and be given some messages. When you are asked for
the number of characters, Mesquite recognizes that you are calculating
some elements of the model using an existing matrix (in particular,
the frequencies of A, C, G, and T), and for this reason it gives
as the default number of characters the number in the original
matrix. That's the number we want to use in this case, as we want
the simulation to be as realistic as possible.
Once the simulations are all done, then go into
PAUP*, and execute the file "PaupCommands.nex". When PAUP* completes
its analyses, go back to Mesquite, and choose (Tree
Window) Analysis > Matrices & Batch Files > Show Results
via Instruction File.... Choose the file "MesquiteInstructions",
in the same folder as the simulated matrices, and then choose
the results file (which should be called "TDTestScore.scr",).
Mesquite should then show you a histogram of the treelength differences.
An analysis of this exact sort but with 500 replicates yielded
the following histogram:

Of the 500 replicates, 439 had a value 0, 58 had a value 1, and
3 had a value 2. You can determine these numbers by touching on
each bar with the arrow:

You can also see these values by going to the Text view of the chart (by touching
the Text tab at the upper part of the chart window):

To see what values are in a specified percentile of the left
or right tail of the distribution, you can choose (Histogram)
Histogram > Analysis > Show Percentile.... In the dialog box presented,
you can choose the value of the percentile, the color of the bar
to be shown, and whether the left or right tails (or both) are
calculated. By default, the percentile value is 0.05; on this
example, it would be displayed as:

All values to the right of this red bar are thus
in the extreme of the distribution, and our simulation would suggest
that any value greater than or equal to 2 would occur with a frequency
of less than 0.05. Our observation of 9 is thus an unlikely outcome
if our hypothesis were true. Thus, we can reject our hypothesis
at p < 0.05, and conclude that trachypachids and dytiscoids
do not form a clade.
The histogram bars are sometimes drawn in such a way that the
vertical percentile bar is somewhat confusing, especially if the
values are whole numbers (we hope to improve this in future versions).
For this reason, you might want to know the exact values of the
percentile boundaries. This information can can be gathered by
looking at the text view for the histogram. Toward the bottom
of the text view in the above example is the following:

The percentile bar shown will be that that is closest
to that requested, but not over it. Thus, if you requested a percentile
of 0.05, and if a percentile bar might be placed at 0.043 (left,
below), but moving it one increment higher would include enough
values to increase the percentile to 0.051 (right, below), then
the percentile bar shown will be 0.043.
Other analyses that Mesquite can do
The examples given about used the
Genesis package in Mesquite to simulate the evolution of characters
under a specified model of DNA evolution, and up the branches of
the current tree on the screen. Many more sorts of analyses can
be done, using different sources of characters other than simulated
DNA data, different sources of trees other than the current tree
on the screen, and different calculations by other programs.
Some of the different sources of character matrices
include:
- Simulators that evolve data other than DNA data, including
continuously valued data.
- Random modifications of existing matrices, including by non-parametric
resampling (as used in classic bootstrap methods), random reshuffling
of data, and jackknifing.
Some of the different sources of trees include:
- Simulations using markovian models of speciation
- Coalescence simulations of gene trees
- Modifications of an existing tree by random modification of
branch lengths
- Modifications of an existing tree by randomly pruning a fraction
of the taxa or by randomly adding taxa
- Randomly sampled trees among all possible trees
- Trees generated by randomly reshuffling taxa at their tips
References
Shull, V., A.P. Vogler, M.D. Baker, D.R. Maddison, and P.M. Hammond.
2001. Sequence alignment of 18S ribosomal RNA and the basal relationships
of adephagan beetles: Evidence for monophyly of aquatic families
and the placement of Trachypachidae. Systematic Biology, 50:945-969.