项目作者: imperial-genomics-facility

项目描述 :
Python library for running data analysis pipelines for IGF team
高级语言: Python
项目地址: git://github.com/imperial-genomics-facility/data-management-python.git
创建时间: 2017-03-24T11:28:45Z
项目社区:https://github.com/imperial-genomics-facility/data-management-python

开源协议:Apache License 2.0

下载


Build Status Documentation Status Codacy Badge

Data Management Using Python Library

https://data-management-python.readthedocs.io

This repository contains the core Python library developed and maintained by the NIHR Imperial BRC Genomics Facility for managing raw and processed genomic datasets efficiently.

Key Features

1. Metadata Management

  • Utilizes an extended ENA metadata model for managing information about:
    • Projects
    • Samples
    • Sequencing runs
    • Analysis
    • File paths and
    • Pipeline instances

2. Genomic Sequencing Runs Processing

  • Tracks ongoing sequencing runs and initiates processing upon completion.
  • Generates summary reports and sends email notifications to users.

3. Analysis Pipelines

  • Includes wrappers for both community-developed and vendor-provided data pipelines.
  • Automates:
    • Configuration generation
    • Input formatting
  • Executes external pipelines on HPC using bash script wrappers.
  • Manages post-processing, including:
    • Custom report generation
    • Analysis data validation

Requirements

• Python v3.10

Installation

1. Clone the Repository

  1. git clone https://github.com/imperial-genomics-facility/data-management-python.git

2. Install Dependencies
Install required Python libraries:

  1. pip install -r requirements_2.10.4.txt # For compatibility with Apache Airflow v2.10.4

3. Update PYTHONPATH
Add the core library path to PYTHONPATH:

  1. export PYTHONPATH=/PATH/data-management-python

Update Airflow version

1. Set env variables

  1. export AIRFLOW_VERSION=VERSION
  2. export PYTHON_VERSION=VERSION
  3. export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

2. Install core Airflow libraries

  1. pip install "apache-airflow[celery,postgres,redis,graphviz,pandas,apache-spark,airbyte,amazon,slack,singularity,ssh,sftp,smtp]==VERSION" --constraint ${CONSTRAINT_URL}

3. Install additional libraries

  1. pip install asana gviz-api html5lib matplotlib PyMySQL pytest pytest-cov tox slackclient --constraint ${CONSTRAINT_URL}

4. List Python library versons in the requirements file

  1. pip freeze > requirements_vVERSION.txt

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.