Histosketching Using Little Kmers
UPDATE: JULY 2019
I no longer work for STFC. All versions of HULK pre 1.0.0 have been renamed and archived to the STFC github. The STFC Hartree Centre are building genomic solutions based on these and other tools - if you are interested, please hartree@stfc.ac.uk">contact them.
This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the open-access paper.
I’ve tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It’s a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses minimizers frequencies for representing the underling microbiome sample.
Importantly, this project is now fully open source!
HULK
is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK
approximates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.
HULK
works by collecting minimizers from sequences. Minimizers are assigned to a finite number of histogram bins using a consistent jump hash; these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by HULK
. Similarly to MinHash sketches, histosketches can be used to estimate similarity between sequence data sets.
The advantages of HULK
include:
hulk smash
into the command line…Finally, you can use hulk sketches to with a Machine Learning classifier to predict microbiome sample origin (see the paper and BANNER).
sketch
subcommand:smash
subcommand:print
and distance
subcommands is available in the smash
subcommandCheck out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.
For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.
conda install -c bioconda hulk
HULK
is written in Go (v1.12) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:
# Clone this repository
git clone https://github.com/will-rowe/hulk.git
# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...
# Run the unit tests
go test -v ./...
# Compile the program
go build ./
# Call the program
./hulk --help
HULK
is called by typing hulk, followed by the subcommand you wish to run. There main subcommands are sketch and smash:
# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA
# Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches
hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfile
I’m working on some new documentation and this will be available on readthedocs soon.
A paper describing the HULK
method is published in Microbiome:
Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.