ETL for moving Ethereum data to PostgreSQL database
The steps below will allow you to bootstrap a PostgreSQL database in GCP with full historical and real-time Ethereum data:
blocks, transactions, logs, token_transfers, and traces.
The whole process will take between 24 and 72 hours.
Prerequisites:
gcloud auth login
pip install -r requirements.txt
export BUCKET=<your_gcs_bucket>
bash ethereum_bigquery_to_gcs.sh $BUCKET
Optionally provide start and end dates: bash ethereum_bigquery_to_gcs.sh $BUCKET 2020-01-01 2020-01-31
Exporting to CSV files is going to take about 10 minutes.
export CLOUD_SQL_INSTANCE_ID=ethereum-0
export ROOT_PASSWORD=<your_password>
gcloud sql instances create $CLOUD_SQL_INSTANCE_ID --database-version=POSTGRES_11 --root-password=$ROOT_PASSWORD \
--storage-type=SSD --storage-size=100 --cpu=4 --memory=6 \
--database-flags=temp_file_limit=2147483647
Notice the storage size is set to 100 GB. It will scale up automatically to around 1.5 TB when we load in the data.
Add Cloud SQL service account to GCS bucket as objectViewer
.
Run gcloud sql instances describe $CLOUD_SQL_INSTANCE_ID
,
then copy serviceAccountEmailAddress
from the output and add it to the bucket.
Create the database and the tables:
gcloud sql databases create ethereum --instance=$CLOUD_SQL_INSTANCE_ID
# Install Cloud SQL Proxy following the instructions here https://cloud.google.com/sql/docs/mysql/sql-proxy#install
./cloud_sql_proxy -instances=myProject:us-central1:${CLOUD_SQL_INSTANCE_ID}=tcp:5433
cat schema/*.sql | psql -U postgres -d ethereum -h 127.0.0.1 --port 5433 -a
bash ethereum_gcs_to_cloud_sql.sh $BUCKET $CLOUD_SQL_INSTANCE_ID
Importing to Cloud SQL is going to take between 12 and 24 hours.
A few performance optimization tips for initial loading of the data:
NOTE: indexes won’t work for the contracts table due to the issue described here https://github.com/blockchain-etl/ethereum-etl-postgres/pull/11#issuecomment-1107801061
cat indexes/*.sql | psql -U postgres -d ethereum -h 127.0.0.1 --port 5433 -a
Creating indexes is going to take between 12 and 24 hours. Depending on the queries you’re going to run
you may need to create more indexes or partition the tables.
Cloud SQL instance will cost you between $200 and $500 per month depending on
whether you use HDD or SSD and on the machine type.
Use ethereumetl stream
command to continually pull data from an Ethereum node and insert it to Postgres tables:
https://github.com/blockchain-etl/ethereum-etl/tree/develop/docs/commands.md#stream.
Follow the instructions here to deploy it to Kubernetes: https://github.com/blockchain-etl/blockchain-etl-streaming.