mandrake’s documentation

mandrake (stochastic cluster embedding with genetic distances)

mandrake is a tool for creating visualisations of pathogen populations from their genome data. The visualisation produces are optimised to produce clusters of similar sequences, represented in a two dimensional embedding.

You may wish to use this tool to:

Get a quick look at the structure of your population, and identify possible clusters.
See if these clusters match with known labels.
Determine whether supervised learning is likely to work on this input data.
Make pretty pictures and animations.

mandrake is primarily a visualisation tool. To determine clusters robustly, we would recommend a model-based method such as fastbaps or poppunk.

To understand local embeddings better, we would recommend the following excellent guide: https://distill.pub/2016/misread-tsne/.

It can take as input:

Assembly or read data (using sketchlib).
A multiple sequence alignment.
A gene presence/absence matrix.

Runs the following steps:

Distance calculation, and sparsification to \(k\) nearest neighbours, or using a threshold.
Conversion of distances to conditional probabilities at the specified perplexity.
A modified version of stochastic cluster embedding.
HDBSCAN on the embedding, or labelling with provided categories.
Plots of the output.

Producing the following output:

A numpy version of the sparse matrix, for reuse.
A text version of the output embedding.
An interactive HTML file with the embedding, and hover labels.
A static version of this embedding.
A hexbin plot to show density of the embedding (which is usually overplotted).
(optionally) A video of the embedding process as the algorithm runs.

mandrake is very fast, and can be used on millions of input samples.

Contents: