项目作者: limteng-rpi

项目描述 :
Code for the paper "A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling" (ACL2018)
高级语言: Python
项目地址: git://github.com/limteng-rpi/mlmt.git
创建时间: 2018-04-30T22:24:45Z
项目社区:https://github.com/limteng-rpi/mlmt

开源协议:

下载


If what you want is a (monolingual, single-task) name tagging model, we have a new implementation at https://github.com/limteng-rpi/neural_name_tagging .

Old files were moved to old/.

Requirements

  • Python 3.5+
  • Pytorch 0.4.1 or Pytorch 1.0 (old scripts use Pytorch 0.3.1)
  • tqdm (used to display training progress)

Architecture

Overall architecture
Figure: Multi-lingual Multi-task Architecture

Pre-trained word embeddings

Pre-trained word embeddings for English, Dutch, Spanish, Russian, and Chechen can be found at this page.

Update: I added English, Dutch, and Spanish case-sensitive word embeddings.

Single-task Mono-lingual Model

Train a new model:

  1. python train_single.py --train <PATH> --dev <PATH>
  2. --test <PATH> --log <LOG_DIRECTORY> --model <MODEL_DIRECTORY>
  3. --max_epoch 50 --word_embed <PATH>
  4. --word_embed_dim 100 --char_embed_dim 50

Evalute the trained model:

  1. python eval_single.py --model <PATH/TO/THE/MODEL/FILE> --file <PATH/TO/THE/DATA/FILE>
  2. --log <LOG/DIRECTORY>

Multi-task Model

In my original code, I use the build_tasks_from_file function in task.py to build the whole architecture from a configuration file (see the Configuration section). pipeline.py shows how to use this function.

Train a new model:

  1. python train_multi.py --train_tgt <PATH> --dev_tgt <PATH>
  2. --test_tgt <PATH> --train_cl <PATH> --dev_cl <PATH>
  3. --test_cl <PATH> --train_ct <PATH> --dev_ct <PATH>
  4. --test_ct <PATH> --train_clct <PATH> --dev_clct <PATH>
  5. --test_clct <PATH> --log <LOG_DIRECTORY>
  6. --model <MODEL_DIRECTORY> --max_epoch 50
  7. --word_embed_1 <PATH> --word_embed_2 <PATH> --word_embed_dim 50

Evalute the trained model:

  1. python eval_multi.py --model <PATH/TO/THE/MODEL/FILE> --file <PATH/TO/THE/DATA/FILE>
  2. --log <LOG/DIRECTORY>
  3. ## Configuration
  4. For complete configuration, see `example_config.json`.
  5. ```json
  6. {
  7. "training": {
  8. "eval_freq": 1000, # Evaluate the model every <eval_freq> global step
  9. "max_step": 50000, # Maximun training step
  10. "gpu": true # Use GPU
  11. },
  12. "datasets": [ # A list of data sets
  13. {
  14. "name": "nld_ner", # Data set name
  15. "language": "nld", # Data set language
  16. "type": "sequence", # Data set type; 'sequence' is the only supported value though
  17. "task": "ner", # Task (identical to the 'task' value of the corresponding task)
  18. "parser": { # Data set parser
  19. "format": "conll", # File format
  20. "token_col": 0, # Token column index
  21. "label_col": 1 # Label column index
  22. },
  23. "sample": 200, # Sample number (optional): 'all', int, or float
  24. "batch_size": 19, # Batch size
  25. "files": {
  26. "train": "/PATH/TO/ned.train.bioes", # Path to the training set
  27. "dev": "/PATH/TO/ned.testa.bioes", # Path to the dev set
  28. "test": "/PATH/TO/ned.testb.bioes" # Path to the test set (optional)
  29. }
  30. },
  31. ...
  32. ],
  33. "tasks": [
  34. {
  35. "name": "Dutch NER", # Task name
  36. "language": "nld", # Task language
  37. "task": "ner", # Task
  38. "model": { # Components can be shared and are configured in 'components'. Just
  39. # put their names here.
  40. "model": "lstm_crf", # Model type
  41. "word_embed": "nld_word_embed", # Word embedding
  42. "char_embed": "char_embed", # Character embedding
  43. "crf": "ner_crf", # CRF layer
  44. "lstm": "lstm", # LSTM layer
  45. "univ_layer": "ner_univ_linear", # Universal/shared linear layer
  46. "spec_layer": "ner_nld_linear", # Language-specific linear layer
  47. "embed_dropout": 0.0, # Embedding dropout probability
  48. "lstm_dropout": 0.6, # LSTM output dropout probability
  49. "linear_dropout": 0.0, # Linear layer output dropout probability
  50. "use_char_embedding": true, # Use character embeddings
  51. "char_highway": "char_highway" # Highway networks for character embeddings
  52. },
  53. "dataset": "nld_ner", # Data set name
  54. "learning_rate": 0.02, # Learning rate
  55. "decay_rate": 0.9, # Decay rate
  56. "decay_step": 10000, # Decay step
  57. "ref": true # Is the target task
  58. },
  59. ...
  60. ],
  61. "components": [
  62. {
  63. "name": "eng_word_embed",
  64. "model": "embedding",
  65. "language": "eng",
  66. "file": "/PATH/TO/enwiki.cbow.50d.txt",
  67. "stats": true,
  68. "padding": 2,
  69. "trainable": true,
  70. "allow_gpu": false,
  71. "dimension": 50,
  72. "padding_idx": 0,
  73. "sparse": true
  74. },
  75. {
  76. "name": "nld_word_embed",
  77. "model": "embedding",
  78. "language": "nld",
  79. "file": "/PATH/TO/nlwiki.cbow.50d.txt",
  80. "stats": true,
  81. "padding": 2,
  82. "trainable": true,
  83. "allow_gpu": false,
  84. "dimension": 50,
  85. "padding_idx": 0,
  86. "sparse": true
  87. },
  88. {
  89. "name": "char_embed",
  90. "model": "char_cnn",
  91. "dimension": 50,
  92. "filters": [[2, 20], [3, 20], [4, 20]]
  93. },
  94. {
  95. "name": "lstm",
  96. "model": "lstm",
  97. "hidden_size": 171,
  98. "bidirectional": true,
  99. "forget_bias": 1.0,
  100. "batch_first": true,
  101. "dropout": 0.0 # Because we use a 1-layer LSTM. This value doesn't have any effect.
  102. },
  103. {
  104. "name": "ner_crf",
  105. "model": "crf"
  106. },
  107. {
  108. "name": "pos_crf",
  109. "model": "crf"
  110. },
  111. {
  112. "name": "ner_univ_linear",
  113. "model": "linear",
  114. "position": "output"
  115. },
  116. {
  117. "name": "ner_eng_linear",
  118. "model": "linear",
  119. "position": "output"
  120. },
  121. {
  122. "name": "ner_nld_linear",
  123. "model": "linear",
  124. "position": "output"
  125. },
  126. {
  127. "name": "pos_univ_linear",
  128. "model": "linear",
  129. "position": "output"
  130. },
  131. {
  132. "name": "pos_eng_linear",
  133. "model": "linear",
  134. "position": "output"
  135. },
  136. {
  137. "name": "pos_nld_linear",
  138. "model": "linear",
  139. "position": "output"
  140. },
  141. {
  142. "name": "char_highway",
  143. "model": "highway",
  144. "position": "char",
  145. "num_layers": 2,
  146. "activation": "selu"
  147. }
  148. ]
  149. }

Reference

  • Lin, Y., Yang, S., Stoyanov, V., Ji, H. (2018) A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling. Proceedings of The 56th Annual Meeting of the Association for Computational Linguistics. [pdf]
  1. @inproceedings{ying2018multi,
  2. title = {A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling},
  3. author = {Ying Lin and Shengqi Yang and Veselin Stoyanov and Heng Ji},
  4. booktitle = {Proceedings of The 56th Annual Meeting of the Association for Computational Linguistics (ACL2018)},
  5. year = {2018}
  6. }