Nextflow-based BAM-to-FASTQ conversion and FASTQ-sorting workflow.
Convert BAM files back to FASTQ.
We do not recommend Conda for running the workflow. It may happen that packages are not available in any channels anymore and that the environment is broken. For reproducible research, please use containers.
Provided you have a working Conda installation, you can run the workflow with
mkdir test_out/
nextflow run main.nf \
-profile local,conda \
-ansi-log \
--input=/path/to/your.bam \
--outputDir=test_out \
--sortFastqs=false
For each BAM file in the comma-separated --input
parameter, one directory with FASTQs is created in the outputDir
. With the local
profile the processing jobs will be executed locally. The conda
profile will let Nextflow create a Conda environment from the task-environment.yml
file. By default, the conda environment will be created in the source directory of the workflow (see nextflow.config).
Dependent on the version of the workflow that you want to run it might not be possible to re-build the Conda environment. Therefore, to guarantee reproducibility we create container images of the task environment.
For instance, if you want to run the workflow locally with Docker you can do e.g.
nextflow run main.nf \
-profile local,docker \
-ansi-log \
--input=test/test1_paired.bam,test/test1_unpaired.bam \
--outputDir=test_out \
--sortFastqs=true
In your cluster, you may not have access to Docker. In this situation you can use Singularity, if it is installed in your cluster. Note that unfortunately, Nextflow will fail to convert the Docker image into a Singularity image, unless Docker is available. But you can get the Singularity image yourself:
Create a Singularity image from the public Docker container
version=1.0.0
repoDir=/path/to/nf-bam2fastq
singularity build \
"$repoDir/cache/singularity/nf-bam2fastq_$version.sif" \
"docker://ghcr.io/dkfz-odcf/nf-bam2fastq:$version"
Note that the location and name of the Singularity image is configured in the nextflow.config
.
nextflow run /path/to/nf-bam2fastq/main.nf \
-profile lsf,singularity \
-ansi-log \
--input=test/test1_paired.bam,test/test1_unpaired.bam \
--outputDir=test_out \
--sortFastqs=true
bamtofastq
.Please have a look at the project board for further information.
input
: Comma-separated list of input BAM-file paths.outputDir
: Output directorysortFastqs
: Whether to produce FASTQs in a similar order as in the input BAM (false
) or sort by name (true
). Default: true
. Turning sorting on produces multiple sort-jobs.excludedFlags
: Comma-separated list of flags to bamtofastq
‘s exclude
parameter. Default: “secondary,supplementary”. If you have complete, BWA-aligned BAM files then exactly the reads of the input FASTQ are reproduced. For other aligners you need to check yourself, what are the optimal parameters.publishMode
: Nextflow’s publish mode. Allowed values are symlink
, rellink
, link
, copy
, copyNoFollow
, move
. Default is rellink
, which produces relative links from the publish dir (in the outputDir
) to the directories in the work/
directory. This is to support an invocation of Nextflow in a “run” directory in which all files (symlinked input data, output data, logs) are stored together (e.g. with nextflow run --outputDir ./
).sortMemory
: Memory used for sorting. Too large values are useless, unless you have enough memory to sort completely in-memory. Default: “100 MB”.sortThreads
: Number of threads used for sorting. Default: 4.compressIntermediateFastqs
: Whether to compress FASTQs produced by bamtofastq
when doing subsequent sorting. Default: true. This is only relevant if sortFastq=true
.compressorThreads
: The compressor (pigz) can use multiple threads. Default: 4. If you set this value to zero, then no additional CPUs are required by Nextflow to be present. However, a single thread still will be used by pigz.In the outputDir
the workflow creates a sub-directory for each input BAM file. These are named like the BAM with one of the suffixes _fastqs
or _sorted_fastqs
added, dependent on the value for sortFastqs
you selected. Each of these directories contains a set of FASTQ files, whose names follow the pattern
${readGroupName}_${readType}.fastq.gz
The read-group name is the name of the “@RG” attribute the reads in the file were found to be connected to. For reads in your BAM that don’t have a read-group assigned the “default” read-group is used. Consequently, your BAMs should not contain a read-group “default”! The read-type is one of the following:
These files are all always produced, independent of whether your data is actually single-end or paired-end. If no reads of any of these groups are present in the input BAM file, empty compressed files are produced. Note further that these files are produced for each read-group in your input BAM, plus the “default” read-group. If you have a BAM in which none of the reads are assigned to a read-group, then all reads can be found in the “default” read-group.
Note that Nextflow creates the work/
directory, the .nextflow/
directory, and the .nextflow.log*
files in the directory in which it is executed.
Nextflow‘s -profile
parameter allows setting technical options for executing the workflow. You have already seen some of the profiles and that these can be combined. We conceptually separated the predefined profiles into two types — those concerning the “environment” and those for selecting the “executor”.
The following “environment” profiles that define which environment will be used for executing the jobs are predefined in the nextflow.config
:
Currently, there are only two “executor” profiles that define the job execution method. These are
bsub
is available.Please refer to the Nextflow documentation for defining other executors. Note that environments and executors cannot arbitrarily be combined. For instance, your LSF administrators may not allow Docker to be executed by normal users.
By default, the Conda environments of the jobs as well as the Singularity containers are stored in subdirectories of the cache/
subdirectory of the workflows installation directory (a.k.a projectDir
by Nextflow). E.g. to use the Singularity container you can install the container as follows
cd $workflowRepoDir
# Refer to the nextflow.config for the name of the Singularity image.
singularity build \
cache/singularity/nf-bam2fastq_1.0.0.sif \
docker://ghcr.io/dkfz-odcf/nf-bam2fastq:1.0.0
# Test your container
test/test1.sh test-results/ singularity nextflowEnv/
This is suited for either a user-specific installation or for a centralized installation for which the environments should be shared for all users. Please refer to the nextflow.config
or the NXF_*_CACHEDIR
environment variables to change this default (see here.
Make sure your users have read and execute permissions on the directories and read permissions on the files in the shared environment directories. Set NXF_CONDA_CACHEDIR
to an absolute path to avoid “Not a conda environment: path/to/env/nf-bam2fastq-3e98300235b5aed9f3835e00669fb59f” errors.
The integration tests can be run with
test/test1.sh test-results/ $profile
This will create a test Conda environment in test-results/nextflowEnv
and then run the tests. For the tests themselves you can use a local Conda environment or a Docker container, dependent on whether you set $profile
to “conda” or “docker”, respectively. These integration tests are also run in Travis CI.
For all commits with a tag that follows the pattern \d+\.\d+\.\d+
the job containers are automatically pushed to Github Container Registry of the “ODCF” organization. Version tags should only be added to commits on the master
branch, although currently no automatic rule enforces this.
The container includes a Conda installation and is pretty big. It should only be released if its content is actually changed. For instance, it would be perfectly fine to have a workflow version 1.6.5 but still refer to an old container for 1.2.7.
This is an outline of the procedure to release the container to Github Container Registry:
versionTag=1.2.0
docker \
build \
-t ghcr.io/dkfz-odcf/nf-bam2fastq:$versionTag \
--build-arg HTTP_PROXY=$HTTP_PROXY \
--build-arg HTTPS_PROXY=$HTTPS_PROXY \
./
nextflow.config
to match $versionTag
.
test/test1.sh docker-test docker
echo $CR_PAT | docker login ghcr.io -u vinjana --password-stdin
docker image push ghcr.io/dkfz-odcf/nf-bam2fastq:$versionTag
1.2.0
-env none
for “lsf” cluster profile. Local environment should not be copied. This probably caused problems with the old “dkfzModules” environment profile.nextflow.config
.*_BINARY
variables in scripts. Binaries are fixed by Conda/containers.conda.enabled = True
with newer Nextflow1.1.0 (February, 2022)
--publishMode
option to allow user to select the Nextflow publish mode. Default: rellink
. Note that the former default was symlink
, but as this change is considered negligible we classified the change as “minor”.dkfzModules
profile. Didn’t work well and was originally only for development. Please use ‘conda’, ‘singularity’ or ‘docker’. The container-based environments provide the best reproducibility.1.0.1 (October 14., 2021)
1.0.0 (June 15., 2021)
bam2fastq.nf
to main.nf
(similar to nf-core projects)--bamFiles
to --input
parameter0.2.0 (November 17., 2020)
0.1.0 (August 26., 2020)
The workflow is a port of the Roddy-based BamToFastqPlugin. Compared to the Roddy-workflow, some problems with the execution of parallel processing, resulting in potential errors, have been fixed.
See LICENSE.txt and CONTRIBUTORS.