|
|
Molecular data
Molecular data (DNA or protein sequences) can be edited, manipulated,
simulated and analyzed in various ways in Mesquite. Most of the
features discussed elsewhere concerning editing and analysis of
general categorical data also apply to molecular data; here we
focus on features specifically designed for sequence data.
Contents

Editing molecular data
Molecular data can be imported from files of NBRF format, PHYLIP
format, and simple table format. It can also be exported to these
formats.
The Character Matrix Editor
can be used to edit a molecular sequence matrix. Standard ambiguity
codes are allowed.
The following can be applied to all or the selected portions
of a molecular sequence matrix in the Character Matrix Editor.
These are available under the Alter/Transform submenu of the Matrix
menu:
- Nucleotide complement (DNA matrix only) —
enters the complementar sequence into the selected cells
- Reverse sequence — reverses the order
of contiguously selected blocks of sequence
- Collapse Gaps — collapses gaps to yield
unaligned sequences
- Collapse Gaps-Only Edges — deletes
characters at edges of matrix that are gaps-only.
Other options may appear; see the page on characters
for standard choices in this submenu. You can also apply the other
editing tools described for character
matrices.
The view of the matrix can be adjusted in various ways. Cells
can be colored according to the state at the site (Color
Cells submenu, Character State) or according
to a value like the GC bias (Color Cells submenu, Cell
Value; can request this coloring to use a moving window).
Examples of this are shown below. The Display submenu of the Matrix
menu contains other options such as a Bird's eye view
which makes the cells narrow to show more of the sequences.
Copy Sequence (at bottom of Edit menu) copies
the selected cells of the matrix into the computer's clipboard
as a sequence. That is, whereas the standard Copy would place
into the clipboard selected pieces of the matrix in tab-delimited
text format (e.g., if the sequence AATCA is selected, "A-tab-A-tab-T-tab-C-tab-A"
would be copied), this modified Copy Sequence command does not
include tabs (thus, "AATCA" would be copied). This style
of copying is useful when interacting with programs like Sequencher
(TM). For instance, if you want to find a piece of sequence
in a matrix in Mesquite within a chromatogram viewer of Sequencher,
do the following: select the sequence in Mesquite, choose Copy
Sequence, then go to Sequencher, select Find Bases, and paste
the sequence as the search string.
Pieces of sequences can be found using the Find Sequence and
Find All Sequences submenus of the Edit menu. The current options
are:
- Matching Sequence: This finds sequences
matching a target sequence the user enters. The search allows
a certain number of mismatches. Optionally, it can search for
the reverse, complement and reverse complement of the target
sequence.
- Maintain Target Match:
This highlights and maintains highlighted the first occurence
of a given sequence in a given taxon. First, you are asked which
taxon to search in. Then, it displays a panel like this:

underneath the matrix. The first button (red X) is to close
the panel; the second pauses the search; the third allows you
to select another taxon as your focus. If you type a sequence
into the text area, the matching sequence (if any) will be highlighted
in the matrix. Mesquite is constantly monitoring this text,
and so you don't need to give any command to find again if you
change the text. This is useful if working with a program like
Sequencher. If you see a stretch of sequence while viewing chromatograms
that you'd like to find in the matrix in Mesquite, type in the
sequence into the text box and you will quickly be taken to
it in the taxon.
- Maintain Clipboard Match:
This is similar to Maintain Target Match, except that it obtains
the search string not from the text area but from the clipboard.
If the clipboard changes, the function will automatically find
the sequence again in the matrix. This is useful if working
with a program like Sequencher. If you turn on Maintain Clipboard
Match, then you can copy stretches of a sequence within Sequencher,
and Mesquite will automatically highlight it, without your having
to return to Mesquite or give any other command to it. (Mesquite
is constantly monitoring the clipboard to see if it changes).
Simulating DNA sequence evolution
DNA sequence evolution can be simulated to build statistical
tests, for instance via parametric bootstrapping. See the page
on simulating DNA sequences.
Statistics for DNA sequences
Calculations for categorical characters in general can be applied
to DNA sequences. For example, Parsimony
calculations can be made for DNA sequences, as can basic descriptive
statistics such as the percent of a sequence or character that
is missing data or gaps. In addition, there are several modules
specifically designed for DNA data, illustrated by examples in
Mesquite_Folder/examples/Molecular. These calculate compositional
bias:
- ACGT Compositional Bias — This module
supplies the compositional bias of taxa, measured over the taxon's
sequence. The bias is treated as a continuous character, and
thus can be used wherever characters are used, as for instance
in the reconstruction of the evolution of compositional bias
as shown in the image below. It can return either the proportion
G+C, or separately A, C, G, and T proportions.

- Character Compositional Bias — This
module supplies the compositional bias for characters. It calculates
the percent of taxa with particular nucleotides (GC bias, or
individual frequency of A, C, G or T) for a character. The image
below shows a moving window analysis of compositional bias along
a sequence; the instructions for generating the chart are given
here.

- GC bias coloring of matrices — The
cells of the Character Matrix Editor may be colored according
to a moving window of GC bias along the sequence, as shown below,
by selecting Matrix>Color
Cells>Color By Cell Value, then once shown the
colors can be smoothed by a moving window analysis by selecting
Matrix>Moving
Window (for colors).

Statistics for Protein Data
- Site hydrophobicity — This module supplies
the average amino acid hydrophobicity, averaged across taxa,
for each site. It can be used in charts, for instance to see
the relationship between a phylogenetic statistic for the site
(character) and it average hydrophobicity. This chart,
for example, shows parsimony character steps as a function of
hydrophobicity:

- Amino Acid hydrophobicity — The cells
of the Character Matrix Editor may be colored according to a
moving window of hydrophobicity along the sequence, as shown
below, by selecting Matrix>Color
Cells>Color By Cell Value, then once shown the
colors can be smoothed by a moving window analysis by selecting
Matrix>Moving
Window (for colors).

Visualizing tertiary structure
Although there are not yet dedicated windows for visualizing
phylogenetic statistics in the context of molecular structure,
features have been added to the Scattergram chart to allow it
to be adapted for this purpose. For instance, in this image cytochrome
B is shown, with the amino acids colored according to a simple
phylogenetic statistic: the number of parsimony steps on a phylogeny.
The colors are smoothed by a moving window, and show that several
coils of the molecule, a few at the left and one deep at the right,
evolve more rapidly than others. This example is illustrated in
the data file at Mesquite_Folder/examples/Molecular/06-cytochromeB.nex

To build such a chart, begin with a file with a matrix of protein
sequences. The procedure is also described in the example files
08-cytochromeBlinked.nex and 09-cytochromeBscatter.nex.
- Select New Linked Matrix from the Characters
menu. When a matrix is made to be linked to a second matrix,
the two matrices are constrained to have the same number of
characters.
- Indicate that you want the linked matrix to be a Continuous
matrix, and link it to your protein matrix. Then, turn it into
a three dimensional matrix (Taxa X Characters X Coordinates
[x, y and z]) by using Add Item and Rename
Item in the Utilities submenu of the Matrix menu of
the Character Matrix Editor. The x,y,z coordinates could be
added for all taxa if known, but otherwise only one taxon needs
to be filled out (because we will use the average x,y,z coordinates
for the amino acids).
- Once the linked matrix of xyz amino acid positions is entered,
select Analysis>New
Scattergram for> Characters. Indicate you want
the scattergram to be for Stored Characters,
and indicate Same value for the two axes. In
the dialog box "Values for axes", choose Mean
Value of Character (Linked Matrix). In response to
"Use characters from which matrix? (for Character Source)"
choose the protein sequence matrix as the matrix to be used.
This will plot the sites (amino acids, characters) in their
correct places, but as a series of round spots.
- To change the appearance of the plot, select Join
the Dots in the Special Effects submenu of the Scattergram
menu. Then select Thick Joints, deselect Show
Dots, deselect Join First to Last,
and set the marker size larger (e.g., 8). This will result in
a plot as shown above, but without the colors.
- Next, choose Color by Third Value from the
Colors menu and choose the value by which to color the amino
acids. For parsimony steps, for instance, choose Character
Value with current tree.
- Finally, to use a moving window to smooth the colors, select
Moving Window for Colors from the Colors menu
and indicate the window size (e.g., 5).
Sequence data within populations
See the page on population genetics.
Reconstructing ancestral states
Ancestral states of continuous characters can be reconstructed
as described in the page on reconstructing
ancestral states. Likelihood methods are not yet available
for molecular characters.
|
|