项目作者: treasure-data

项目描述 :
Hive Japanese NLP UDFs with NEologd
高级语言: Java
项目地址: git://github.com/treasure-data/hive-udf-neologd.git
创建时间: 2018-05-09T03:59:29Z
项目社区:https://github.com/treasure-data/hive-udf-neologd

开源协议:Apache License 2.0

下载


Hive Japanese NLP UDFs with NEologd

Build Status

This package extends Hivemall‘s Japanese NLP capability by utilizing NEologd.

Before getting started, build the latest version of hivemall-all-{HIVEMALL_VERSION}.jar as documented on Hivemall installation guide.

Usage

Run build script:

  1. ./build.sh

The build script is modified version of kazuhira-r/kuromoji-with-mecab-neologd-buildscript.

Use the UDFs on Hive:

  1. add jar hivemall-all-{HIVEMALL_VERSION}.jar; -- e.g., hivemall-all-0.5.1-incubating-SNAPSHOT.jar
  2. add jar hive-udf-neologd-{VERSION}-{NEOLOGD_VERSION_DATE}.jar; -- e.g., hive-udf-neologd-0.1.0-20180524.jar;
  3. create temporary function tokenize_ja_neologd as 'hivemall.nlp.tokenizer.KuromojiNEologdUDF';
  4. select tokenize_ja_neologd();
  5. -- ["{VERSION}-{NEOLOGD_VERSION_DATE}"]
  6. select tokenize_ja_neologd('10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。');
  7. -- ["10日","放送","中居正広の身になる図書館","テレビ朝日","系","smap","中居正広","篠原信一","過去","勘違い","明かす","一幕"]