项目作者: DimasDMM

项目描述 :
Pyramid is a novel layered model for Nested Named Entity Recognition (nested NER). This code is based on the paper *Pyramid: A Layered Model for Nested Named Entity Recognition* by Jue Wang et al.
高级语言: Jupyter Notebook
项目地址: git://github.com/DimasDMM/pyramid.git
创建时间: 2020-12-09T23:02:37Z
项目社区:https://github.com/DimasDMM/pyramid

开源协议:

下载


Pyramid

Introduction

Pyramid is a novel layered model for Nested Named Entity Recognition (nested NER). This code is based in the paper Pyramid: A Layered Model for Nested Named Entity Recognition by Jue Wang et al.

./images/pyramid-example.jpg

Note that this code is based in my own understanding of the paper. Nevertheless, the authors released the code of the paper at https://github.com/LorrinWWW/Pyramid/.

This repository also contains a step-by-step execution in the notebooks contained in the folder notebooks.

Set up

Clone this repository, create default folders and install dependencies:

  1. git clone https://github.com/DimasDMM/pyramid.git
  2. cd pyramid
  3. mkdir data
  4. mkdir artifacts
  5. pip install -r requirements.txt

Download GloVe embeddings:

  1. cd data
  2. wget http://nlp.stanford.edu/data/glove.6B.zip --no-check-certificate
  3. unzip glove.6B.zip
  4. cd ..

It is necessary that you also download the tokenizer and pretrained LM* beforehand:

  1. python run_download_lm.py --lm_name dmis-lab/biobert-v1.1

*Feel free to use any pretrained model from HuggingFace: https://huggingface.co/models

Dataset

GENIA is the dataset where I have tested this repository. You can download and prepare this dataset with these commands:

  1. cd data
  2. wget http://www.nactem.ac.uk/GENIA/current/GENIA-corpus/Term/GENIAcorpus3.02.tgz --no-check-certificate
  3. mkdir GENIA
  4. tar -xvf GENIAcorpus3.02.tgz -C GENIA
  5. cd ..
  6. python run_preprocess.py \
  7. --dataset genia \
  8. --raw_filepath "./data/GENIA/GENIA_term_3.02/GENIAcorpus3.02.xml" \
  9. --lm_name dmis-lab/biobert-v1.1 \
  10. --cased 0

If you want to use a different dataset, it must be a JSON file as follows:

  1. {
  2. "tokens": ["token0", "token1", "token2"],
  3. "entities": [
  4. {
  5. "entity_type": "PER",
  6. "span": [0, 1],
  7. },
  8. {
  9. "entity_type": "ORG",
  10. "span": [2, 3],
  11. },
  12. ]
  13. }

Commands

Fine-tune model:

  1. python run_training.py \
  2. --model_ckpt ./artifacts/genia/ \
  3. --wv_file ./data/glove.6B.200d.txt \
  4. --use_label_embeddings 0 \
  5. --use_char_encoder 1 \
  6. --dataset genia \
  7. --max_epoches 500 \
  8. --max_steps 1e9 \
  9. --total_layers 16 \
  10. --batch_size 64 \
  11. --token_emb_dim 200 \
  12. --char_emb_dim 100 \
  13. --cased_lm 0 \
  14. --cased_word 0 \
  15. --cased_char 0 \
  16. --hidden_dim 100 \
  17. --dropout 0.45 \
  18. --lm_name dmis-lab/biobert-large-cased-v1.1 \
  19. --lm_emb_dim 1024 \
  20. --device cuda \
  21. --continue_training 0 \
  22. --log_to_file logger_genia.txt

Once the model is fine-tunned, run the evaluation script:

  1. python run_evaluator.py \
  2. --model_ckpt ./artifacts/genia/ \
  3. --dataset genia \
  4. --device cuda

Parameters

The parameters that you can use are the following ones:

  • device: Device to use: cpu or cuda.
  • model_ckpt: Path to store the model.
  • wv_file: (Optional, default=None) Path to file with embeddings of words. If not provided, it won’t use the Word Encoder described in the paper.
  • use_label_embeddings: (Optional, default=0) Uses a label embedding layer in the top of the model.
  • use_char_encoder: (Optional, default=1) Uses the Char Encoder described in the paper.
  • dataset: Name of the dataset to use. The dataset files must be located in the folder ./data with the names train.<dataset>.json, valid.<dataset>.json and test.<dataset>.json for the train, validation and test datasets respectively.
  • max_epoches: (Optional, default=500) Maximum number of epoches for training.
  • max_steps: (Optional, default=1e9) Maximum number of steps for training.
  • total_layers: (Optional, default=16) Number of layers in the pyramid.
  • batch_size: (Optional, default=64) Batch size for training.
  • token_emb_dim: (Optional, default=100) Dimension of token embeddings.
  • char_emb_dim: (Optional, default=100) Dimension of char embeddings.
  • cased_lm: (Optional, default=1) Use cased LM Encoder.
  • cased_word: (Optional, default=1) Use cased Word Encoder.
  • cased_char: (Optional, default=1) Use cased Char Encoder.
  • hidden_dim: (Optional, default=100) Hidden dimension of LSTM layers in the pyramid. Since the LSTM layers are bidirectional, the actual hidden dimension will be twice the value.
  • dropout: (Optional, default=0.45) Dropout rate.
  • lm_name: (Optional, default=dmis-lab/biobert-large-cased-v1.1) Pretrained language model from Hugging Face. The model must be already downloaded in the folder ./artifacts (use the script run_download.py to download and store it).
  • lm_emb_dim: (Optional, default=1024) Hidden dimension of the language model.
  • continue_training: (Optional, default=0) In order to avoid overriding a trained model, this flag must be set to 1 if we want to continue training a model from a checkpoint. If the model already exists and the flag is 0, it will throw an error.
  • log_to_file: (Optional, default=None) File to store the standard output.

Additional comments

This repository includes sbatch files to run the scripts with Slurm. See: https://slurm.schedmd.com/.

To do

  • Add option to load pretrained label embeddings for training.

Have fun! ᕙ (° ~ ° ~)

./images/pyramid-pie.jpg