Input options
Any of the following modes also accept --labels
, which give the categories
to colours points by in the output plots. This should be a tab-separated file
with no header, the first column containing sample names, and the second column
containing the labels for each sample.
There are two additional parameters for the distances:
--threshold THRESHOLD
Maximum distance to consider [default = None]
--kNN KNN Number of k nearest neighbours to keep when sparsifying the distance
matrix.
Rather than using a full (dense) matrix of all pairwise distances, mandrake uses a sparse matrix, ignoring large distances. This uses significantly less memory without affecting results.
--kNN
sets the number of distances to keep for each sample, which will be the
\(k\) closest. Set \(k\) to a number smaller than the number of samples.
Memory use grows linearly with \(k\). Setting \(k\) too small will miss global
structure in the data.
--threshold
instead picks a maximum distance that is considered meaningful, and
larger distances will be removed from the input.
Multiple sequence alignment
Provide a multi-fasta alignment with --alignment
. Distances will be calculated
using a modified form of the pairsnp algorithm,
and sparsified based on -kNN
or --threshold
.
If you are trying to align large numbers of sequences (e.g. SARS-CoV-2), the reference-guided mode of MAFFT may be helpful:
mafft --6merpair --thread -1 --keeplength --addfragments filtered_SC2.fasta \
nCoV-2019.reference.fasta > MA_filt_SC2.fasta
Sketch database (assemblies or reads)
Provide a pp-sketchlib database
with --sketches
, to calculate core and accessory distances
between the sketches. Core distances are used by default, but add --use-accessory
to
alter this behaviour.
This should be a HDF5 file with suffix .h5
produced by sketchlib, for example
by a command such as:
sketchlib sketch -l sample_rfile.txt -o sketch_db --cpus 16
Gene presence/absence
Pan-genome programs such as roary and
panaroo output a gene_presence_absence.Rtab
file, which can be used with --accessory
to calculate accessory distances (Hamming distances).
unitig counting programs such as unitig-caller also output this file format, though the interpretation of the distances is slightly different it can also be used as input.
Precalculated distances
After calculating distances, mandrake will save these as <output_prefix>.npz
. These
can be used as input without the need to compute them again with --distances
,
which is useful when you wish to run the embedding algorithm on the same data with
different parameters.