Python library for running data analysis pipelines for IGF team
https://data-management-python.readthedocs.io
This repository contains the core Python library developed and maintained by the NIHR Imperial BRC Genomics Facility for managing raw and processed genomic datasets efficiently.
1. Metadata Management
2. Genomic Sequencing Runs Processing
3. Analysis Pipelines
• Python v3.10
1. Clone the Repository
git clone https://github.com/imperial-genomics-facility/data-management-python.git
2. Install Dependencies
Install required Python libraries:
pip install -r requirements_2.10.4.txt # For compatibility with Apache Airflow v2.10.4
3. Update PYTHONPATH
Add the core library path to PYTHONPATH:
export PYTHONPATH=/PATH/data-management-python
1. Set env variables
export AIRFLOW_VERSION=VERSION
export PYTHON_VERSION=VERSION
export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
2. Install core Airflow libraries
pip install "apache-airflow[celery,postgres,redis,graphviz,pandas,apache-spark,airbyte,amazon,slack,singularity,ssh,sftp,smtp]==VERSION" --constraint ${CONSTRAINT_URL}
3. Install additional libraries
pip install asana gviz-api html5lib matplotlib PyMySQL pytest pytest-cov tox slackclient --constraint ${CONSTRAINT_URL}
4. List Python library versons in the requirements file
pip freeze > requirements_vVERSION.txt
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.