Infra for ingest-hbase replica
The Intraday cluster is a read-replica of ingest-hbase.
Records are read from the replica cluster, decrypted in memory and stored in s3
using DKS. Encryption in s3 is handled by the EMR Security Configuration.
Limitations:
meta
table. This results in incompatibilityOther documentation:
The Intraday Schedule is designed to process data hourly during the working day, exluding the HBase maintenance windows.
HBase maintenance is documented in the internal-compute repo, and started/stopped by
concourse jobs
Intraday scheduling is achieved using cloudwatch cron rules to trigger the Intraday lambda.
The dynamodb table intraday-job-status
records details for each collection processed, including:
If a scheduled cluster is already running when the job is triggered, the lambda will try waiting for
approx. 15 minutes before timing out. If the running cluster later completes successfully, this will not prevent
subsequent launches.
If a scheduled cluster fails during launch or processing, subsequent clusters will not launch until the dynamodbJobStatus
is updated (i.e. from FAILED
-> _FAILED
). This is to provide time for troubleshooting and resolution
of the error.
HBase read-replica clusters are not able to write/modify the data stored in HBase, but they do create folders in the
hbase root directory to manage a copy of the metadata. A directory is created for each cluster launched, and left
behind after termination.
These directories cause “inconsistencies” in the main cluster, which identifies the files as data for which it has no
metadata. To avoid inconsistencies being reported in the ingest-hbase cluster, the replica metadata is purged by lambda
at the termination of each replica cluster.
Logs are collected in cloudwatch under /app/ingest-replica-incremental/
The is a concourse pipeline for intraday named dataworks-aws-ingest-replica
, defined in the ci
folder.
Admin jobs can be found in the utility group in concourse,
intraday-emr-admin
This will start a cluster with no steps. The cluster must be terminated manually once no longer required
This can be used to stop a cluster with the given ID. Amend the cluster_id parameter in the code and aviator the
change before running the job.
This can be used to remove HBase read-replica metadata if not already done so by the lambda. Provide the CLUSTER_ID
parameter, and aviator the change before running the job.
This will trigger a cluster as if triggered by the cloudwatch rule. The lambda will check for other running clusters
before launching, check for the timestamps of latest processed records, and process everything from that point onwards
for the collections defined in the aws-secrets for Intraday.