项目作者: embulk

项目描述 :
Embulk plugin that loads records from Google Cloud Storage
高级语言: Java
项目地址: git://github.com/embulk/embulk-input-gcs.git
创建时间: 2015-03-01T05:06:39Z
项目社区:https://github.com/embulk/embulk-input-gcs

开源协议:

下载


Google Cloud Storage file input plugin for Embulk

Overview

embulk-input-gcs v0.5.0+ requires Embulk v0.11.0+.

  • Plugin type: file input
  • Resume supported: yes
  • Cleanup supported: yes

Usage

Install plugin

  1. java -jar embulk-X.Y.Z.jar install "org.embulk:embulk-input-gcs:0.5.0"

Google Service Account Settings

If you chose “private_key” or “json_key” as auth_method, you can get service_account_email and private_key or json_key like below.

  1. Make project at Google Developers Console.

  2. Make “Service Account” with this step.

    Service Account has two specific scopes: read-only, read-write.

    embulk-input-gcs can run “read-only” scopes.

  3. Generate private key in P12(PKCS12) format or json_key, and upload to machine.

run

  1. java -jar embulk-X.Y.Z.jar run /path/to/config.yml

Configuration

  • bucket Google Cloud Storage bucket name (string, required)
  • path_prefix prefix of target keys (string, either of “path_prefix” or “paths” is required)
  • paths list of target keys (array of string, either of “path_prefix” or “paths” is required)
  • path_match_pattern: regexp to match file paths. If a file path doesn’t match with this pattern, the file will be skipped (regexp string, optional)
  • incremental: enables incremental loading(boolean, optional. default: true. If incremental loading is enabled, config diff for the next execution will include last_path parameter so that next execution skips files before the path. Otherwise, last_path will not be included.
  • auth_method (string, optional, “private_key”, “json_key” or “compute_engine”. default value is “private_key”)
  • service_account_email Google Cloud Storage service_account_email (string, required when auth_method is private_key)
  • p12_keyfile fullpath of p12 key (string, required when auth_method is private_key)
  • json_keyfile fullpath of json_key (string, required when auth_method is json_key)
  • application_name application name anything you like (string, optional)

Example

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. bucket: my-gcs-bucket
  8. path_prefix: logs/csv-
  9. auth_method: private_key #default
  10. service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
  11. p12_keyfile: /path/to/p12_keyfile.p12
  12. application_name: Anything you like

Example for “sample_01.csv.gz” , generated by embulk example

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. bucket: my-gcs-bucket
  8. path_prefix: sample_
  9. auth_method: private_key #default
  10. service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
  11. p12_keyfile: /path/to/p12_keyfile.p12
  12. application_name: Anything you like
  13. decoders:
  14. - {type: gzip}
  15. parser:
  16. charset: UTF-8
  17. newline: CRLF
  18. type: csv
  19. delimiter: ','
  20. quote: '"'
  21. header_line: true
  22. columns:
  23. - {name: id, type: long}
  24. - {name: account, type: long}
  25. - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
  26. - {name: purchase, type: timestamp, format: '%Y%m%d'}
  27. - {name: comment, type: string}
  28. out: {type: stdout}

To skip files using regexp:

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. bucket: my-gcs-bucket
  8. path_prefix: logs/csv-
  9. # ...
  10. path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern
  11. ## some examples of regexp:
  12. #path_match_pattern: /archive/ # match files in .../archive/... directory
  13. #path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
  14. #path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz

Authentication

There are three methods supported to fetch access token for the service account.

  1. Public-Private key pair of GCP(Google Cloud Platform)’s service account
  2. JSON key of GCP(Google Cloud Platform)’s service account
  3. Pre-defined access token (Google Compute Engine only)

Public-Private key pair of GCP’s service account

You first need to create a service account (client ID), download its private key and deploy the key with embulk.

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. auth_method: private_key
  8. service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
  9. p12_keyfile: /path/to/p12_keyfile.p12

JSON key of GCP’s service account

You first need to create a service account (client ID), download its json key and deploy the key with embulk.

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. auth_method: json_key
  8. json_keyfile: /path/to/json_keyfile.json

You can also embed contents of json_keyfile at config.yml.

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. auth_method: json_key
  8. json_keyfile:
  9. content: |
  10. {
  11. "private_key_id": "123456789",
  12. "private_key": "-----BEGIN PRIVATE KEY-----\nABCDEF",
  13. "client_email": "..."
  14. }

Pre-defined access token(GCE only)

On the other hand, you don’t need to explicitly create a service account for embulk when you
run embulk in Google Compute Engine. In this third authentication method, you need to
add the API scope “https://www.googleapis.com/auth/devstorage.read_only“ to the scope list of your
Compute Engine VM instance, then you can configure embulk like this.

Setting the scope of service account access for instances

  1. in:
  2. type:
  3. source: maven
  4. group: org.embulk
  5. name: gcs
  6. verison: "0.5.0"
  7. auth_method: compute_engine

Eventually Consistency

An operation listing objects is eventually consistent although getting objects is strongly consistent, see https://cloud.google.com/storage/docs/consistency.

path_prefix uses the objects list API, therefore it would miss some of objects.
If you want to avoid such situations, you should use paths option which directly specifies object paths without the objects list API.

For Maintainers

Build

  1. ./gradlew jar

Test

To run unit tests, we need to configure the following environment variables.

Additionally, following files will be needed to upload to existing GCS bucket.

When environment variables are not set, skip some test cases.

  1. GCP_EMAIL
  2. GCP_P12_KEYFILE
  3. GCP_JSON_KEYFILE
  4. GCP_BUCKET
  5. GCP_BUCKET_DIRECTORY(optional, if needed)

If you’re using Mac OS X El Capitan and GUI Applications(IDE), like as follows.

  1. $ vi ~/Library/LaunchAgents/environment.plist
  2. <?xml version="1.0" encoding="UTF-8"?>
  3. <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
  4. <plist version="1.0">
  5. <dict>
  6. <key>Label</key>
  7. <string>my.startup</string>
  8. <key>ProgramArguments</key>
  9. <array>
  10. <string>sh</string>
  11. <string>-c</string>
  12. <string>
  13. launchctl setenv GCP_EMAIL ABCXYZ123ABCXYZ123.gserviceaccount.com
  14. launchctl setenv GCP_P12_KEYFILE /path/to/p12_keyfile.p12
  15. launchctl setenv GCP_JSON_KEYFILE /path/to/json_keyfile.json
  16. launchctl setenv GCP_BUCKET my-bucket
  17. launchctl setenv GCP_BUCKET_DIRECTORY unittests
  18. </string>
  19. </array>
  20. <key>RunAtLoad</key>
  21. <true></true>
  22. </dict>
  23. </plist>
  24. $ launchctl load ~/Library/LaunchAgents/environment.plist
  25. $ launchctl getenv GCP_EMAIL //try to get value.
  26. Then start your applications.

Release

Modify version in build.gradle at a detached commit, and then tag the commit with an annotation.

  1. git checkout --detach master
  2. (Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)
  3. git add build.gradle
  4. git commit -m "Release vX.Y.Z"
  5. git tag -a vX.Y.Z
  6. (Edit: Write a tag annotation in the changelog format.)

See Keep a Changelog for the changelog format. We adopt a part of it for Git’s tag annotation like below.

  1. ## [X.Y.Z] - YYYY-MM-DD
  2. ### Added
  3. - Added a feature.
  4. ### Changed
  5. - Changed something.
  6. ### Fixed
  7. - Fixed a bug.

Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.

  1. git push -u origin vX.Y.Z