项目作者: guilledufort

项目描述 :
A FASTQ lossless compression algorithm especially designed for nanopore sequencing FASTQ files.
高级语言: C++
项目地址: git://github.com/guilledufort/EnanoFASTQ.git
创建时间: 2020-01-30T20:31:07Z
项目社区:https://github.com/guilledufort/EnanoFASTQ

开源协议:MIT License

下载


ENANO FASTQ

An encoder for nanopore FASTQ files

Publication: https://doi.org/10.1093/bioinformatics/btaa551

Description

ENANO is a FASTQ lossless compression algorithm especially designed for nanopore sequencing FASTQ files. We tested ENANO and current state-of-the-art compressors on several publicly available nanopore datasets. The results show that our algorithm consistently achieves the best compression performance on every nanopore dataset, while being computationally efficient in terms of speed and memory requirements when compared to existing alternatives.

Install with Conda

To install directly from source, follow the instructions in the next section.

Enano is available on conda via the bioconda channel. See this page for installation instructions for conda. Once conda is installed, do the following to install enano.

  1. conda config --add channels defaults
  2. conda config --add channels bioconda
  3. conda config --add channels conda-forge
  4. conda install enano

Note that if enano is installed this way, it should be invoked with the command enano rather than ./enano. The bioconda help page shows the commands if you wish to install enano in an environment.

Install from source code

Download repository

  1. git clone https://github.com/guilledufort/EnanoFASTQ.git

Requirements

  1. g++ ( >= 4.8.1)
  2. OpenMP library

Install

The following instructions will create the enano executable in the directory enano.
To compile enano you need to have the g++ compiler and the OpenMP library for multithreading.

On Linux (Ubuntu or CentOS) g++ usually comes installed by default, but if not run the following:

  1. sudo apt update
  2. sudo apt-get install g++

On macOS, install GCC compiler since Clang has issues with OpenMP library:

  • Install HomeBrew (https://brew.sh/)
  • Install GCC (this step will be faster if Xcode command line tools are already installed using xcode-select --install):
    1. brew update
    2. brew install gcc@9

The g++ installer also installs the OpenMP library, so no further steps are needed.
To check if the g++ compiler is properly installed in your system run:

On Linux

  1. g++ --version

On MacOS:

  1. g++-9 --version

The output should be the description of the installed software.

To compile enano run:

  1. cd EnanoFASTQ/enano
  2. make

USAGE

Run the enano executable /PATH/TO/enano (or just enano if installed with conda) with the options below:

  1. To compress:
  2. enano [options] [input_file [output_file]]
  3. -c To use MAX COMPRESION MODE. Default is FAST MODE.
  4. -k <length> Basecall sequence context length. Default is 7 (max 13).
  5. -l <lenght> Length of the DNA neighborhood sequence used in the quality score context. Default is 6.
  6. -t <num> Maximum number of threads allowed to use by the compressor. Default is 8.
  7. To decompress:
  8. enano -d [options] foo.enano foo.fastq
  9. -t <num> Maximum number of threads allowed to use by the decompressor. Default is 8.

Datasets information

To test our compressor we ran experiments on the following datasets. The full information of the datasets is on our publication.

Dataset Num. of files size (GB) Description Link

sor** | 4 | 124.071 | Sorghum bicolor Tx430 | https://www.nature.com/articles/s41467-018-07271-1#data-availability | bra* | 18 | 43.014 | Doubled haploid canola (Brassica napus L.) | https://www.nature.com/articles/s41598-019-45131-0#data-availability | lun | 13 | 15.239 | Human lung bacterial metagenomic | https://www.nature.com/articles/s41587-019-0156-5#data-availability | joi | 9 | 4.672 | Infected orthopaedic devices metagenomic | https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5094-y | vir* | 10 | 4.375 | Direct RNA sequencing (HSV-1) | https://www.nature.com/articles/s41467-019-08734-9#data-availability | hs1 | 1 | 249.791 | Human GM12878 Utah/Ceph cell line | https://github.com/nanopore-wgs-consortium/NA12878 | hs2^ | 50 | 193.920 | Human GM12878 Utah/Ceph cell line | https://www.nature.com/articles/s41467-019-09637-5#data-availability| npd** | 336 | 113.440 | Multiple organisms | https://github.com/guidufort/DualFqz |

*Datasets that require the SRA toolkit to be downloaded.

^We only used the first 50 files of the dataset.

Downloading the datasets

To download a dataset you have to run the download_script.sh of the specific dataset.
For example, to download sor run:

  1. cd EnanoFASTQ
  2. dataset/sor/download_script.sh

The scripts use the command wget to perform the download.
To install wget on macOS run:

  1. brew install wget

To install wget on Ubuntu or CentOS run:

  1. sudo apt-get install wget

Some datasets require the SRA toolkit (2.9.6-1 release) to be downloaded. To install the SRA toolkit you can follow the instructions here https://ncbi.github.io/sra-tools/install_config.html, and place the toolkit’s root-folder under the EnanoFASTQ directory, or you can run one of the scripts we provide. There is a different script for each OS, so you have to choose the one corresponding to your OS.
For example, to install the SRA toolkit on macOS you can run:

  1. cd EnanoFASTQ
  2. ./install_SRA_mac.sh

Examples

If installed using conda, use the command enano instead of enano/enano.

Compress using ENANO

To run the compressor with 4 threads on the example file:

  1. cd EnanoFASTQ
  2. enano/enano -k 8 -l 5 -t 4 example/SAMPLE.fastq example/SAMPLE.enano

Decompress using ENANO

To decompress with 8 threads the example compressed file:

  1. cd EnanoFASTQ
  2. enano/enano -d -t 8 example/SAMPLE.enano example/SAMPLE_dec.fastq

Check if decoding is successful

The output has to be empty.

  1. cmp example/SAMPLE.fastq example/SAMPLE_dec.fastq

Credits

The methods used for encoding the reads names, model frequency counters, and to do the reads parsing, are the ones proposed by James Bonefield in FQZComp, with some modifications. The range coder is derived from Eugene Shelwien.