a microframework for ETL solutions intended to help you write more code while reducing boilerplate
Support for this Python Library is very limited. Requirements and Python versions are out of date. It surved as a great sandbox to experiment with creating a high-level pythonic API to quickly process large amounts of information.
If you’re looking into HPC or processing large amounts of data. I recommend looking into better supported software.
This repository takes some of its best ideas from solutions like Dask, Python’s multiprocessing, and Zapa.
A microframework for simple ETL solutions.
At its core, bert-etl
uses Dynamodb Streams to communicate between lambda functions. bert-etl.yaml
provides control on how the initial lambda function is called, either by periodic events, sns topics, or s3 bucket (planned)events. Passing an event to bert-etl
is straight forward from zappa
or a generic AWS lambda function you’ve hooked up to API Gateway.
At this moment in time, there are no plans to attach API Gateway to bert-etl.yaml
because there is already great software(like zappa
) that does this.
bert-etl
ships with a deploy target to aws-lambda
. This feature isn’t very well documented yet, and has quite a bit of work to de done so it may function more consistently. Be aware that aws-lambda
is a product ran and controlled by AWS. If you incure charges using bert-etl
while utilizing aws-lambda
, you may not consider us responsible. bert-etl
is offered under MIT
license which includes a Use at your own risk
clause.
Lets begin with an example of loading data from a file-server and than loading it into numpy arrays
$ virtualenv -p $(which python3) env
$ source env/bin/activate
$ pip install bert-etl
$ pip install librosa # for demo project
$ docker run -p 6379:6379 -d redis # bert-etl runs on redis to share data across CPUs
$ bert-runner.py -n demo
$ PYTHONPATH='.' bert-runner.py -m demo -j sync_sounds -f
Bert provides a boiler plate framework that’ll allow one to write concurrent ETL code using Pythons’ microprocessing
module. One function starts the process, piping data into a Redis backend that’ll then be consumed by the next function. The queues are respectfully named for the scope of the function: Work(start) and Done(end) queue. Please consider contributing to Bert Bounty Targets to improve this documentation
https://www.patreon.com/jbcurtin
bert-etl.yaml
bert.binding
comm_binder
work_queue
done_queue
ologger
DEBUG
and how turning it off allows for x-concurrent processes