项目作者: po3rin

项目描述 :
kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.
高级语言: Python
项目地址: git://github.com/po3rin/kuro2sudachi.git
创建时间: 2021-01-18T13:33:03Z
项目社区:https://github.com/po3rin/kuro2sudachi

开源协议:Apache License 2.0

下载


kuro2sudachi

PyPi version
PyTest

kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.

Usage

  1. $ pip install kuro2sudachi
  2. $ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt

Custom pos convert dict

you can overwrite convert config with setting json file.

  1. {
  2. "固有名詞": {
  3. "sudachi_pos": "名詞,固有名詞,一般,*,*,*",
  4. "left_id": 4786,
  5. "right_id": 4786,
  6. "cost": 5000
  7. },
  8. "名詞": {
  9. "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
  10. "left_id": 5146,
  11. "right_id": 5146,
  12. "cost": 5000
  13. }
  14. }
  1. $ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json

if you want to ignore unsupported pos error & invalid format, use --ignore flag.

Dictionary type

You can specify the dictionary with the tokenize option -s (default: core).

  1. $ pip install sudachidict_full
  2. $ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full

Auto Splitting

kuro2sudachi supports suto splitting.

  1. {
  2. "名詞": {
  3. "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
  4. "left_id": 5146,
  5. "right_id": 5146,
  6. "cost": 5000,
  7. "split_mode": "C",
  8. "unit_div_mode": [
  9. "A", "B"
  10. ]
  11. }
  12. }

output includes unit devision info.

  1. $ cat kuromoji_dict.txt
  2. 融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
  3. 発作性心房細動,発作性心房細動,発作性心房細動,名詞
  4. $ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore
  5. $ cat sudachi_user_dict.txt
  6. 融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
  7. 発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*

Splitting Words defined by kuromoji

Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.

  1. 中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
  2. 中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*

For Developer

test kuro2sudachi

  1. $ poetry install
  2. $ poetry run pytest

exec kuro2sudachi command

  1. $ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt