项目作者: d3b-center

项目描述 :
🏭Transform and export data from a relational database to a PFB (Portable Format for Bioinformatics) Avro file
高级语言: Python
项目地址: git://github.com/d3b-center/d3b-lib-pfb-exporter.git
创建时间: 2020-02-27T15:28:26Z
项目社区:https://github.com/d3b-center/d3b-lib-pfb-exporter

开源协议:Apache License 2.0

下载



Logo for The Center for Data Driven Discovery





🏭 PFB Exporter

Transform and export data from a relational database into a
PFB (Portable Format for Bioinformatics) file.

A PFB file is special kind of Avro file, suitable for capturing and
reconstructing relational data. Read Background for more information.

NOTE: This is still a 🚧 prototype as its only been tested on the Kids First
PostgreSQL database

Quickstart

  1. $ git clone git@github.com:d3b-center/d3b-lib-pfb-exporter.git
  2. $ cd d3b-lib-pfb-exporter
  3. $ python3 -m venv venv
  4. $ source venv/bin/activate
  5. $ pip install -e .
  6. $ pfbe -h

Try it out:

  1. # List commands and docs
  2. $ pfbe --help
  3. # Create a PFB file from the given data and SQLAlchemy models
  4. $ pfbe export tests/data/input -m tests/data/models.py -o tests/data/pfb_export
  5. # Create a PFB file from the given data and generate SQLAlchemy models from db
  6. $ pfbe export tests/data/input -d $MY_DB_CONN_URL -m tests/data/models.py -o tests/data/pfb_export
  7. # Create just the PFB schema from the given SQLAlchemy models
  8. $ pfbe create_schema -m tests/data/models.py -o tests/data/pfb_export
  9. # Create just the PFB schema but first generate the SQLAlchemy models from db
  10. $ pfbe create_schema -d $MY_DB_CONN_URL -m tests/data/models.py -o tests/data/pfb_export

Outputs

The output contains the generated PFB file, logs, and other files for debugging

  1. tests/data/pfb_export
  2. ├── logs
  3. └── pfb_export.log -> Log file containing log statements from console
  4. ├── metadata.json -> PFB Metadata Entity
  5. ├── models.py -> Generated SQLAlchemy model classes if run with -d CLI option
  6. ├── orm_models.json -> Serialized SQLAlchemy model classes
  7. ├── pfb.avro -> The PFB file
  8. └── pfb_schema.json -> The PFB file schema

Supported Databases

Theoretically, any of the databases supported by SQLAlchemy but this
has only been tested on a single PostgreSQL database.

Developers

Follow Quickstart instructions first. Then install dev requirements:

  1. $ pip install -r dev-requirements.txt

Background

What is an Avro File?

A file with data records (JSON) and a schema (JSON) to describe each data
record. Avro files can be serialized into a binary format and compressed.

Read more about Avro.

What is a PFB File?

A PFB file is special kind of Avro file, suitable for capturing and
reconstructing biomedical relational data.

A PFB file is an Avro file with a particular Avro schema that represents a
relational database. We call this schema the
PFB Schema

The data in a PFB file contains a list of JSON objects called PFB Entity
objects. There are 2 types of PFB Entities. One (Metadata) captures
information about the relational database and the other (Table Row) captures
a row of data from a particular table in the database.

The data records in a PFB file are produced by transforming the original data
from a relational database into PFB Entity objects. Each PFB Entity object
conforms to its Avro schema.

Vanilla Avro vs PFB

Let’s say a client receives an Avro file. It reads in the Avro data.
Now a client has the Avro schema and all of the data that conforms to that
schema in a big JSON blob. It can do what it wants. Maybe it wants to construct
some data input forms. It has everything it needs to do this since the schema
has all of the entities, attributes, and types for those attributes defined.

Now what happens if the client wants to reconstruct a relational database
from the data? How does it know what tables to create, and what the
relationships are between those tables? Which relationships are
required vs not? This is one of the problems PFB addresses.

How PFB Exporter CLI Works

How PFB Exporter Works

PFB File Creation

  1. Create the Avro schemas for PFB Entity types and the PFB File
  2. Transform the JSON objects representing rows of data from the relational
    database into PFB Entities
  3. Add the Avro schemas to the PFB Avro file
  4. Add the PFB Entities to the Avro file

PFB Schema Creation

The PFB File schema is created from SQLAlchemy declarative base classes
in a file or directory. If the classes are not provided, they are generated
by inspecting the database’s schema using the
sqlacodegen library.