bioinformatics pipelines
This repository containers our pipelines, containers and scripts for them.
Pipelines are written in WDL language and Cromwell workflow execution engine is used to run them.
There are three major ways of running any of the pipelines in the repository:
For RNA-Seq pipeline there is also an option to start the pipeline inside the CKAN system.
In order to run the pipeline three important components must be present: main workflow, subworkflows (if exist) and input yaml or json file.
For example, for RNA-Seq workflow the user should choose pipelines/quantification/quantification.wdl as main workflow and quant_sample.wdl, quant_run.wdl, extract_run.wdl as subworkflows.
In the input json GSM ids should be specified together with pates to output folder and salmon indexes. With CromwellClient it is possible to put default json values for each of te pipelines to reduce the number of input parameters.
All the tools used by the workflows are used in the from of official (at dockerhub) or unofficial (https://quay.io/repository/comp-bio-aging/)[https://quay.io/repository/comp-bio-aging/], build specifically for the pipeline, containers with dockerfile sources at https://github.com/antonkulaga/biocontainers.
There is no need to install any tools separately, only running docker daemon is needed. Docker installation is documented at the official Docker website.
The pipeline is executed by Cromwell workflow execution engine.
There is also a microservice that allows starting the pipeline directly from CKAN. The main part of the pipeline is a quantification workflow.
The quantification part requires only GSM ids and list of indexed transcriptomes as input and consists of 4 sub workflows:
There are two ways of running the pipeline with quantification.wdl and GSM ids or with quant_by_runs.wdl and SRR idds.
{
"quant_index_batch.indexes_folder": "/data/indexes/salmon/1.1.0/ensembl_99",
"quant_index_batch.threads_per_index": 5,
"quant_index_batch.references": [
{
"species": "acanthochromis_polyacanthus",
"genome": "/data/ensembl/99/species/acanthochromis_polyacanthus/Acanthochromis_polyacanthus.ASM210954v1.dna.toplevel.fa",
"transcriptome": "/data/ensembl/99/species/acanthochromis_polyacanthus/Acanthochromis_polyacanthus.ASM210954v1.cdna.all.fa",
"version": "ASM210954v1",
"subversion": "ensembl_99"
}],
"quant_index_batch.memory_per_index": "24G"
}
To run with SRR as ids quant_by_runs.wdl is used instead of quantification, also samples.wdl is not needed, the rest remains the same as with quantification pipeline:
{
"quantification.samples_folder": "/data/samples",
"quantification.experiments": [
"GSM1580881",
"GSM1580889"
],
"quantification.title": "Human brain",
"quantification.salmon_indexes" : {
"Homo sapiens" : "/data/indexes/salmon/13.1/Homo_sapiens/gencode.v30",
"Mus musculus" : "/data/indexes/salmon/14.1/Mus_musculus/gencode.vM21"
},
"quantification.transcripts2genes" : {
"Homo sapiens" : "/data/ensembl/96/tx2gene/Homo_sapiens/hsapiens_96_tx2gene.tsv",
"Mus musculus" : "/data/ensembl/96/tx2gene/Mus_musculus/mmusculus_96_tx2gene.tsv"
}
}
{
"quant_by_runs.samples_folder": "/data/samples",
"quant_by_runs.runs": [
"SRR1822398", "SRR1822406", "SRR3109726", "SRR3109724", "SRR489494"
],
"quant_by_runs.salmon_indexes" : {
"Homo sapiens" : "/data/indexes/salmon/14.1/Homo_sapiens/gencode.v30",
"Mus musculus" : "/data/indexes/salmon/14.1/Mus_musculus/gencode.vM21"
},
"quant_by_runs.transcripts2genes" : {
"Homo sapiens" : "/data/ensembl/100/tx2gene/Homo_sapiens/hsapiens_100_tx2gene.tsv",
"Mus musculus" : "/data/ensembl/100/tx2gene/Mus_musculus/mmusculus_100_tx2gene.tsv"
}
}
To use Bs-Seq pipeline the user must first build the index, i.e:
{
"bitmapper_index.index_name": "homo_sapiens",
"bitmapper_index.genome": "/data/ensembl/100/species/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa",
"bitmapper_index.destination": "/data/indexes/bitmapper/ensembl_100/"
}
After building te index, the user can run bs_map_fast.wdl pipeline with extract_run.wdl as subpipeline (does the same job as extract.wdl in RNA-Seq)
Bs-Seq pipeline receives SRA run id with some parameters as output folder, layout, index path and threads as input, i.e:
{
"bs_map_fast.run": "SRR948855",
"bs_map_fast.output_folder": "/data/samples/bs-seq/SRR948855",
"bs_map_fast.layout": "PAIRED",
"bs_map_fast.genome_index": "/data/indexes/bitmapper/ensembl_97/homo_sapiens",
"bs_map_fast.genome": "/data/ensembl/97/species/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa",
"bs_map_fast.extract_threads": 4,
"bs_map_fast.copy_cleaned": false,
"bs_map_fast.map_threads": 8
}
It does quality control, reads deduplication and alignment as well as methylation extraction.
The repository also contains functional Chip-Seq ( pipelines/chip-seq ) and de-novo DNA assembly ( pipelines/de-novo/dna) as well as tools section with WDL scripts for multiple tools.
DNA-Seq pipeline is located in a separate repository at github.com/antonkulaga/dna-seq