Input options

Any of the following modes also accept --labels, which give the categories to colours points by in the output plots. This should be a tab-separated file with no header, the first column containing sample names, and the second column containing the labels for each sample.

There are two additional parameters for the distances:

--threshold THRESHOLD
                      Maximum distance to consider [default = None]
--kNN KNN             Number of k nearest neighbours to keep when sparsifying the distance
                      matrix.

Rather than using a full (dense) matrix of all pairwise distances, mandrake uses a sparse matrix, ignoring large distances. This uses significantly less memory without affecting results.

--kNN sets the number of distances to keep for each sample, which will be the \(k\) closest. Set \(k\) to a number smaller than the number of samples. Memory use grows linearly with \(k\). Setting \(k\) too small will miss global structure in the data.

--threshold instead picks a maximum distance that is considered meaningful, and larger distances will be removed from the input.

Multiple sequence alignment

Provide a multi-fasta alignment with --alignment. Distances will be calculated using a modified form of the pairsnp algorithm, and sparsified based on -kNN or --threshold.

If you are trying to align large numbers of sequences (e.g. SARS-CoV-2), the reference-guided mode of MAFFT may be helpful:

mafft --6merpair --thread -1 --keeplength --addfragments filtered_SC2.fasta \
nCoV-2019.reference.fasta > MA_filt_SC2.fasta

Sketch database (assemblies or reads)

Provide a pp-sketchlib database with --sketches, to calculate core and accessory distances between the sketches. Core distances are used by default, but add --use-accessory to alter this behaviour.

This should be a HDF5 file with suffix .h5 produced by sketchlib, for example by a command such as:

sketchlib sketch -l sample_rfile.txt -o sketch_db --cpus 16

Gene presence/absence

Pan-genome programs such as roary and panaroo output a gene_presence_absence.Rtab file, which can be used with --accessory to calculate accessory distances (Hamming distances).

unitig counting programs such as unitig-caller also output this file format, though the interpretation of the distances is slightly different it can also be used as input.

Precalculated distances

After calculating distances, mandrake will save these as <output_prefix>.npz. These can be used as input without the need to compute them again with --distances, which is useful when you wish to run the embedding algorithm on the same data with different parameters.