项目作者: chakki-works

项目描述 :
Character Based Named Entity Recognition.
高级语言: Python
项目地址: git://github.com/chakki-works/namaco.git
创建时间: 2017-10-11T00:24:21Z
项目社区:https://github.com/chakki-works/namaco

开源协议:

下载


namaco

namaco is a library for character-based Named Entity Recognition.
namaco will especially focus on Japanese and Chinese named entity recognition.

Demo

The following demo shows Chinese Named Entity Recognition:

gif

Feature Support

namaco would provide following features:

  • learning model by your data.
  • tagging sentences by learned model.

Install

To install namaco, simply run:

  1. $ pip install namaco

Data format

The data must be in the following format(tsv):

  1. B-PERSON
  2. E-PERSON
  3. O
  4. O
  5. O
  6. O
  7. S-LOC
  8. O
  9. O
  10. B-DATE
  11. E-DATE

Get Started

Import

First, import the necessary modules:

  1. import os
  2. import namaco
  3. from namaco.data.reader import load_data_and_labels
  4. from namaco.data.preprocess import prepare_preprocessor
  5. from namaco.config import ModelConfig, TrainingConfig
  6. from namaco.models import CharNER

They include loading modules, a preprocessor and configs.

Then, set parameters to use later:

  1. DATA_ROOT = 'data/ja/ner'
  2. SAVE_ROOT = './models' # trained model
  3. LOG_ROOT = './logs' # checkpoint, tensorboard
  4. model_file = os.path.join(SAVE_ROOT, 'model.h5')
  5. model_config = ModelConfig()
  6. training_config = TrainingConfig()

Loading data

After importing the modules, read data for training and validation:

  1. train_path = os.path.join(DATA_ROOT, 'train.txt')
  2. valid_path = os.path.join(DATA_ROOT, 'valid.txt')
  3. x_train, y_train = load_data_and_labels(train_path)
  4. x_valid, y_valid = load_data_and_labels(valid_path)

After reading the data, prepare preprocessor and model:

  1. p = prepare_preprocessor(x_train, y_train)
  2. model = CharNER(model_config, p.vocab_size(), p.tag_size())

Now we are ready for training :)

Training a model

Let’s train a model. For training a model, we can use Trainer.
Trainer manages everything about training.
Prepare an instance of Trainer class and give train data and valid data to train method:

  1. trainer = namaco.Trainer(model,
  2. model.loss,
  3. training_config,
  4. log_dir=LOG_ROOT,
  5. save_path=model_file,
  6. preprocessor=p)
  7. trainer.train(x_train, y_train, x_valid, y_valid)

If training is progressing normally, progress bar would be displayed as follows:

  1. ...
  2. Epoch 3/15
  3. 702/703 [============================>.] - ETA: 0s - loss: 60.0129 - f1: 89.70
  4. 703/703 [==============================] - 319s - loss: 59.9278
  5. Epoch 4/15
  6. 702/703 [============================>.] - ETA: 0s - loss: 59.9268 - f1: 90.03
  7. 703/703 [==============================] - 324s - loss: 59.8417
  8. Epoch 5/15
  9. 702/703 [============================>.] - ETA: 0s - loss: 58.9831 - f1: 90.67
  10. 703/703 [==============================] - 297s - loss: 58.8993
  11. ...

Tagging a sentence

We can use Tagger for tagging text.
Prepare an instance of Tagger class and give text to tag method:

  1. tagger = namaco.Tagger(model_file, preprocessor=p, tokenizer=list)

Let’s try to tag a sentence, 安倍首相が訪米した
We can do it as follows:

  1. >>> sent = '安倍首相が訪米した'
  2. >>> tagger.analyze(sent)
  3. {
  4. "language": "jp",
  5. "text": "安倍首相が訪米した",
  6. "entities": [
  7. {
  8. "text": "安倍",
  9. "type": "Person",
  10. "score": 0.972231
  11. "beginOffset": 0,
  12. "endOffset": 2,
  13. },
  14. {
  15. "text": "米",
  16. "type": "Location",
  17. "score": 0.941431
  18. "beginOffset": 6,
  19. "endOffset": 7,
  20. }
  21. ]
  22. }