Embulk plugin that loads records from Google Cloud Storage
embulk-input-gcs v0.5.0+ requires Embulk v0.11.0+.
java -jar embulk-X.Y.Z.jar install "org.embulk:embulk-input-gcs:0.5.0"
If you chose “private_key” or “json_key” as auth_method, you can get service_account_email and private_key or json_key like below.
Make project at Google Developers Console.
Make “Service Account” with this step.
Service Account has two specific scopes: read-only, read-write.
embulk-input-gcs can run “read-only” scopes.
Generate private key in P12(PKCS12) format or json_key, and upload to machine.
java -jar embulk-X.Y.Z.jar run /path/to/config.yml
parameter so that next execution skips files before the path. Otherwise, last_path
will not be included.
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: logs/csv-
auth_method: private_key #default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
application_name: Anything you like
Example for “sample_01.csv.gz” , generated by embulk example
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: sample_
auth_method: private_key #default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
application_name: Anything you like
- {type: gzip}
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
To skip files using regexp:
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
bucket: my-gcs-bucket
path_prefix: logs/csv-
# ...
path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern
## some examples of regexp:
#path_match_pattern: /archive/ # match files in .../archive/... directory
#path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
#path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
There are three methods supported to fetch access token for the service account.
You first need to create a service account (client ID), download its private key and deploy the key with embulk.
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: private_key
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
You first need to create a service account (client ID), download its json key and deploy the key with embulk.
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: json_key
json_keyfile: /path/to/json_keyfile.json
You can also embed contents of json_keyfile at config.yml.
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: json_key
content: |
"private_key_id": "123456789",
"private_key": "-----BEGIN PRIVATE KEY-----\nABCDEF",
"client_email": "..."
On the other hand, you don’t need to explicitly create a service account for embulk when you
run embulk in Google Compute Engine. In this third authentication method, you need to
add the API scope “https://www.googleapis.com/auth/devstorage.read_only“ to the scope list of your
Compute Engine VM instance, then you can configure embulk like this.
Setting the scope of service account access for instances
source: maven
group: org.embulk
name: gcs
verison: "0.5.0"
auth_method: compute_engine
An operation listing objects is eventually consistent although getting objects is strongly consistent, see https://cloud.google.com/storage/docs/consistency.
uses the objects list API, therefore it would miss some of objects.
If you want to avoid such situations, you should use paths
option which directly specifies object paths without the objects list API.
./gradlew jar
To run unit tests, we need to configure the following environment variables.
Additionally, following files will be needed to upload to existing GCS bucket.
When environment variables are not set, skip some test cases.
GCP_BUCKET_DIRECTORY(optional, if needed)
If you’re using Mac OS X El Capitan and GUI Applications(IDE), like as follows.
$ vi ~/Library/LaunchAgents/environment.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
launchctl setenv GCP_EMAIL ABCXYZ123ABCXYZ123.gserviceaccount.com
launchctl setenv GCP_P12_KEYFILE /path/to/p12_keyfile.p12
launchctl setenv GCP_JSON_KEYFILE /path/to/json_keyfile.json
launchctl setenv GCP_BUCKET my-bucket
launchctl setenv GCP_BUCKET_DIRECTORY unittests
$ launchctl load ~/Library/LaunchAgents/environment.plist
$ launchctl getenv GCP_EMAIL //try to get value.
Then start your applications.
Modify version
in build.gradle
at a detached commit, and then tag the commit with an annotation.
git checkout --detach master
(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)
git add build.gradle
git commit -m "Release vX.Y.Z"
git tag -a vX.Y.Z
(Edit: Write a tag annotation in the changelog format.)
See Keep a Changelog for the changelog format. We adopt a part of it for Git’s tag annotation like below.
## [X.Y.Z] - YYYY-MM-DD
### Added
- Added a feature.
### Changed
- Changed something.
### Fixed
- Fixed a bug.
Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.
git push -u origin vX.Y.Z