Binary representation of fastq files
NOTE: THIS IS A WORK IN PROGRESS. SEE SECTION “Currently implemented” FOR
CURRENTLY IMPLEMENTED FUNCTIONALITIES.
Fastqube is little tool that makes a binary representation of fastq file(s).
Fastq files are typically very inefficient text files, and are as such
usually directly compressed with gzip.
Many tools exist that transform fastq into a better compressible format,
but those typically rely on aligning to a reference genome first (e.g CRAM).
Instead, fastqube is a simple direct binary representation of a given fastq
file. As such, each read consists of three fields:
Fastqube files also contain a 4096-byte header, which contains some metadata
about the program and binary mode used, and reserves space for potential future
metadata fields.
Apart from the lossless mode described above, fastqube also supports three
different lossy modes through various settable parameters. These parameters
can be combined.
Read IDs are fixed-with, but nevertheless are settable. When setting a read ID
size of 0, read IDs are not emitted to the compressed stream at all.
When serializing fastqube files back to fastq files, read IDs are generated as
a simple integer, optionally with a pair tag. I.e. the n-th read will have an
ID of n
.
With block quality encoding enabled, quality scores are stored in 3-bit
representation, with only five possible values: 0, 2, 26, 31 and 41. Any other
scores will be rounded down to the nearest possible value.
With 2-bit sequencing encoding enabled, the sequence is stored in a 2-bit
representation. N, and all other IUPAC symbols, will be squashed to a G.
G is also the base the Novaseq uses as its ‘black’ color, hence we hope that
choosing this base as the fallback mimics the NovaSeq’s behavior.
A typical fastq read with a 64-byte ID and 150 bases consists of 364 bytes.
The direct lossless binary representation of such a read would consist of
about 234 bytes.
With the lossy modes the size is further trimmed to:
Block Quality Mode
Block Quality Mode
and a read ID size of 0.Block Quality Mode
and a read ID size of 0, andSingle end modes optionally read from stdin.
fastqube -c input.fastq > output.fqb
fastqube -c -R1 R1.fastq -R2 R2.fastq -o1 R1.fqb -o2 R2.fqb
fastqube -d input.fqb > output.fastq
fastqube -d -R1 R1.fqb -R2 R2.fqb -o1 R1.fastq -o2 R2.fastq
Lossly compression is enabled with several parameters:
fastqube -B 0 -c input.fastq > output.fqb
fastqube -2 -c input.fastq > output.fqb
fastqube -b -c input.fastq > output.fqb
fastqube -b -B 0 -2 -c input.fastq > output.fqb
Lossy decompression uses the same command-line as lossless decompression:
the fastqube header contains the information about which lossy mode was used.
There is a little prototype implemented in python. It so far
only supports compression in lossless mode. You can use this prototype
to get an impression of how fastqube-generated files will ultimately look like.
As this is a prototype, it is really slow, and even the file format is liable
to change.
BSD-3-clause