项目作者: EMBEDDIA

项目描述 :
Stacked-Transformers Named Entity Recogition
高级语言: Python
项目地址: git://github.com/EMBEDDIA/stacked-ner.git
创建时间: 2021-01-28T14:06:40Z
项目社区:https://github.com/EMBEDDIA/stacked-ner

开源协议:MIT License

下载


Run the code

BERT models need to be dowloaded (with the exception of CamemBERT)

Training:

  1. CUDA_VISIBLE_DEVICES=1,2,3 python main.py
  2. --directory TEMP_MODEL # path to save the model; predictions on test/dev will be automatically saved here at the end of training
  3. --pre_trained_model PRETRAINED_MODEL_NAME #bert-base-cased
  4. --train_dataset train.tsv
  5. --test_dataset test.tsv
  6. --dev_dataset valid.tsv
  7. --batch_size 4
  8. --do_train
  9. --no_cpu 5
  10. --language french #for CamemBERT; english for other models
  11. --model stacked # or bert
  12. --num_layers 2 #2 Transformer layers

Predicting:

  1. python main.py
  2. --directory TEMP_MODEL #same param as train.py
  3. --pre_trained_model PRETRAINED_MODEL_NAME #same param as main.py
  4. --train_dataset train.tsv #same param as main.py
  5. --test_dataset test.tsv #same param as main.py
  6. --dev_dataset valid.tsv #same param as main.py
  7. --dataset_dir DIR_DATA_TEST #directory with .tsv to be predicted
  8. --output_dir DIR_DATA_TEST_PREDICTIONS #directory where predictions will be saved
  9. --batch_size 4
  10. --do_eval
  11. --saved_model TEMP_MODEL/best/best_ #best model after training
  12. --no_cpu 5
  13. --language french #for CamemBERT; english for other; same as main.py
  14. --model stacked # or bert; same as main.py
  15. --num_layers 2 #2 Transformer layers; same as main.py
Dataset Annotation
  1. TOKEN NE-COARSE-LIT NE-COARSE-METO NE-FINE-LIT NE-FINE-METO NE-FINE-COMP NE-NESTED NEL-LIT NEL-METO MISC
  2. # language = fr
  3. # newspaper = GDL
  4. # date = 1878-02-22
  5. # document_id = GDL-1878-02-22-a-i0014
  6. # segment_iiif_link = _
  7. LAUSANNE B-loc O B-loc.adm.town O O O Q807 _ EndOfLine
  8. On O O O O O O _ _ _
  9. nous O O O O O O _ _ _
  10. prie O O O O O O _ _ _
  11. de O O O O O O _ _ _
  12. faire O O O O O O _ _ _
  13. connaître O O O O O O _ _ _
  14. le O O O O O O _ _ _
  15. résultat O O O O O O _ _ EndOfLine
  16. Sécuniaire O O O O O O _ _ _
  17. des O O O O O O _ _ _
  18. quatre O O O O O O _ _ _
  19. conférences O O O O O O _ _ _
  20. sur O O O O O O _ _ _
  21. l' O O O O O O _ _ NoSpaceAfter
  22. Orient B-loc O B-loc.adm.sup O O O Q205653 _ EndOfLine
  23. M B-pers O B-pers.ind O B-comp.title O Q123894 _ NoSpaceAfter
  24. . I-pers O I-pers.ind O I-comp.title O Q123894 _ _
  25. le I-pers O I-pers.ind O O O Q123894 _ _
  26. professeur I-pers O I-pers.ind O B-comp.function O Q123894 _ _
  27. Gilliéron I-pers O I-pers.ind O B-comp.name O Q123894 _ NoSpaceAfter
  28. . O O O O O O _ _ EndOfLine

Requirements

  1. pip install -r requirements.txt

How to citate:

  1. @inproceedings{boros2020robust,
  2. title={Robust named entity recognition and linking on historical multilingual documents},
  3. author={Boros, Emanuela and Pontes, Elvys Linhares and Cabrera-Diego, Luis Adri{\'a}n and Hamdi, Ahmed and Moreno, Jos{\'e} and Sid{\`e}re, Nicolas and Doucet, Antoine},
  4. booktitle={Conference and Labs of the Evaluation Forum (CLEF 2020)},
  5. volume={2696},
  6. number={Paper 171},
  7. pages={1--17},
  8. year={2020},
  9. organization={CEUR-WS Working Notes}
  10. }
  1. @inproceedings{borocs2020alleviating,
  2. title={Alleviating digitization errors in named entity recognition for historical documents},
  3. author={Boro{\c{s}}, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
  4. booktitle={Proceedings of the 24th Conference on Computational Natural Language Learning},
  5. pages={431--441},
  6. year={2020}
  7. }