GEne Cluster prediction with COnditional random fields.
GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and
scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs)
in genomic and metagenomic data using Conditional Random Fields (CRFs).
GECCO is implemented in Python, and supports all
versions from Python 3.7. It requires
additional libraries that can be installed directly from
PyPI, the Python Package Index.
Use pip
to install GECCO on your
machine:
$ pip install gecco-tool
If you’d rather use Conda, a package is available
in the bioconda
channel. You can install
with:
$ conda install -c bioconda gecco
This will install GECCO, its dependencies, and the data needed to run
predictions. This requires around 40MB of data to be downloaded, so
it could take some time depending on your Internet connection. Once done,
you will have a gecco
command available in your $PATH.
Note that GECCO uses HMMER3, which can only run
on PowerPC and recent x86-64 machines running a POSIX operating system.
Therefore, GECCO will work on Linux and OSX, but not on Windows.
Once gecco
is installed, you can run it from the terminal by giving it a
FASTA or GenBank file with the genomic sequence you want to analyze, as
well as an output directory:
$ gecco run --genome some_genome.fna -o some_output_dir
Additional parameters of interest are:
--jobs
, which controls the number of threads that will be spawned byos.cpu_count
.--cds
, controlling the minimum number of consecutive genes a BGC region--threshold
, controlling the minimum probability for a gene to be--cds-feature
, which can be supplied a feature name to extract genes--cds-feature CDS
.GECCO will create the following files:
{genome}.genes.tsv
: The genes file, containing the genes extracted{genome}.features.tsv
: The features file, containing the identified{genome}.clusters.tsv
: If any were found, a clusters file, containing{genome}_cluster_{N}.gbk
: If any were found, a GenBank file per cluster,GECCO can also convert results to other formats that may be more convenient
depending on the downstream usage. GECCO can convert results into:
gecco convert clusters --format gff
).gecco convert gbk --format bigslice
).gecco convert gbk --format fna
)gecco convert gbk --format faa
).To get a more visual way of exploring of the predictions, you
can open the GenBank files in a genome editing software like UGENE.
You can otherwise load the results into an AntiSMASH report: check the
Integrations page of the
documentation for a step-by-step guide.
GECCO can be cited using the following preprint:
Accurate de novo identification of biosynthetic gene clusters with GECCO.
Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller.
bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509
Found a bug ? Have an enhancement request ? Head over to the GitHub issue
tracker if you need to report
or ask something. If you are filing in on a bug, please include as much
information as you can about the issue, and try to recreate the same bug
in a simple, easily reproducible situation.
Contributions are more than welcome! See CONTRIBUTING.md
for more details.
This software is provided under the GNU General Public License v3.0 or later. GECCO is developped by the Zeller Team
at the European Molecular Biology Laboratory in Heidelberg.