项目作者: yu-iskw

项目描述 :
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale
高级语言: Java
项目地址: git://github.com/yu-iskw/kuromoji-for-bigquery.git
创建时间: 2017-11-29T04:18:28Z
项目社区:https://github.com/yu-iskw/kuromoji-for-bigquery

开源协议:

下载


kuromoji-for-bigquery

Build Status

kuromoji-for-bigquery tokenizes text on a BigQuery table with kuromoji and apache beam.
And then the tokenized result will be stored into another BigQuery table.

It is horizontally-scalable on top of distributed system, since apache beam can run on Google Dataflow, Apache Spark, Apache Flink and so on.

Overview

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Version Info

  • Apache Beam: 2.42.0
  • Kuromoji: 0.7.7

How to Use

Command Line Options

Required Options

  • --project: Google Cloud Project
  • --inputDataset: Input BigQuery dataset ID
  • --inputTable: Input BigQuery table ID
  • --tokenizedColumn: Column name to tokenize in a input table
  • --outputDataset: Output BigQuery dataset ID
  • --outputTable: Output BigQuery table ID
  • --schema: BigQuery schema to select columns in a input table. (Format: id:integer,name:string,value:float,ts:timestamp)
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional Options

  • --outputColumn: Output column for tokenized result in output table. (Default: token)
  • --kuromojiMode: Kuromoji Mode. (NORMAL, SEARCH, or EXTENDED) (Default: NORMAL)
  • --createDisposition: Create Disposition option for BigQuery. (CREATE_NEVER or CREATE_IF_NEEDED)
  • --writeDisposition: Write Disposition option for BigQuery. (WRITE_TRUNCATE, WRITE_APPEND or WRITE_EMPTY)
  • --runner: Apache Beam runner.
    • When you don’t set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Run the command

  1. # compile
  2. mvn clean package
  3. # Run bigquery-to-datastore via the compiled JAR file
  4. java -jar $(pwd)/target/kuromoji-for-bigquery-bundled-0.4.1.jar \
  5. --project=test-project-id \
  6. --schema=id:integer \
  7. --inputDataset=test_input_dataset \
  8. --inputTable=test_input_table \
  9. --outputDataset=test_output_dataset \
  10. --outputTable=test_output_table \
  11. --tokenizedColumn=text \
  12. --outputColumn=token \
  13. --kuromojiMode=NORMAL \
  14. --tempLocation=gs://test_yu/test-log/ \
  15. --gcpTempLocation=gs://test_yu/test-log/ \
  16. --maxNumWorkers=10 \
  17. --workerMachineType=n1-standard-2

Versions

kuromoji-for-bigquery Apache Beam kuromoji
0.1.0 2.1.0 0.7.7
0.2.x 2.20.0 0.7.7
0.3.x 2.34.0 0.7.7
0.4.x 2.42.0 0.7.7

License

Copyright (c) 2017 Yu Ishikawa.