项目作者: dsindex

项目描述 :
reference code for syntaxnet
高级语言: Python
项目地址: git://github.com/dsindex/syntaxnet.git
创建时间: 2016-05-17T00:43:21Z
项目社区:https://github.com/dsindex/syntaxnet

开源协议:

下载


Table of Contents generated with DocToc

syntaxnet

description

  • test code for syntaxnet
    • training and test a model using UD corpus.
    • training and test a Korean parser model using the Sejong corpus.
    • exporting a trained model and serving(limited to the designated version of syntaxnet(old one))
    • training and test a model using dragnn.
    • comparision to bist-parser.

history

  • 2017. 3. 27

    • test for dragnn
    • version
      1. python : 2.7
      2. bazel : 0.4.3
      3. protobuf : 3.2.0
      4. syntaxnet : 40a5739ae26baf6bfa352d2dec85f5ca190254f8
  • 2017. 3. 10

    • modify for recent version of syntaxnet(tf 1.0), OS X(bash script), universal treebank v2.0
    • version
      1. python : 2.7
      2. bazel : 0.4.3
      3. protobuf : 3.0.0b2, 3.2.0
      4. syntaxnet : bc70271a51fe2e051b5d06edc6b9fd94880761d5
  • 2016. 8. 16

    • add ‘char-map’ to context.pbtxt’ for train
    • add ‘—resource_dir’ for test
      • if you installed old version of syntaxnet(ex, a4b7bb9a5dd2c021edcd3d68d326255c734d0ef0 ), you should specify path to each files in ‘context.pbtxt’
    • version
      1. syntaxnet : a5d45f2ed20effaabc213a2eb9def291354af1ec

how to test

  1. # after installing syntaxnet.
  2. # gpu supporting : https://github.com/tensorflow/models/issues/248#issuecomment-288991859
  3. $ pwd
  4. /path/to/models/syntaxnet
  5. $ git clone https://github.com/dsindex/syntaxnet.git work
  6. $ cd work
  7. $ echo "hello syntaxnet" | ./demo.sh
  8. # training parser only with parsed corpus
  9. $ ./parser_trainer_test.sh

univeral dependency corpus

training tagger and parser with another corpus

  1. # for example, training UD_English.
  2. # detail instructions can be found in https://github.com/tensorflow/models/tree/master/syntaxnet
  3. $ ./train.sh -v -v
  4. ...
  5. #preprocessing with tagger
  6. INFO:tensorflow:Seconds elapsed in evaluation: 9.77, eval metric: 99.71%
  7. INFO:tensorflow:Seconds elapsed in evaluation: 1.26, eval metric: 92.04%
  8. INFO:tensorflow:Seconds elapsed in evaluation: 1.26, eval metric: 92.07%
  9. ...
  10. #pretrain parser
  11. INFO:tensorflow:Seconds elapsed in evaluation: 4.97, eval metric: 82.20%
  12. ...
  13. #evaluate pretrained parser
  14. INFO:tensorflow:Seconds elapsed in evaluation: 44.30, eval metric: 92.36%
  15. INFO:tensorflow:Seconds elapsed in evaluation: 5.42, eval metric: 82.67%
  16. INFO:tensorflow:Seconds elapsed in evaluation: 5.59, eval metric: 82.36%
  17. ...
  18. #train parser
  19. INFO:tensorflow:Seconds elapsed in evaluation: 57.69, eval metric: 83.95%
  20. ...
  21. #evaluate parser
  22. INFO:tensorflow:Seconds elapsed in evaluation: 283.77, eval metric: 96.54%
  23. INFO:tensorflow:Seconds elapsed in evaluation: 34.49, eval metric: 84.09%
  24. INFO:tensorflow:Seconds elapsed in evaluation: 34.97, eval metric: 83.49%
  25. ...

training parser only

  1. # if you have other pos-tagger and want to build parser only from the parsed corpus :
  2. $ ./train_p.sh -v -v
  3. ...
  4. #pretrain parser
  5. ...
  6. #evaluate pretrained parser
  7. INFO:tensorflow:Seconds elapsed in evaluation: 44.15, eval metric: 92.21%
  8. INFO:tensorflow:Seconds elapsed in evaluation: 5.56, eval metric: 87.84%
  9. INFO:tensorflow:Seconds elapsed in evaluation: 5.43, eval metric: 86.56%
  10. ...
  11. #train parser
  12. ...
  13. #evaluate parser
  14. INFO:tensorflow:Seconds elapsed in evaluation: 279.04, eval metric: 94.60%
  15. INFO:tensorflow:Seconds elapsed in evaluation: 33.19, eval metric: 88.60%
  16. INFO:tensorflow:Seconds elapsed in evaluation: 32.57, eval metric: 87.77%
  17. ...

test new model

  1. $ echo "this is my own tagger and parser" | ./test.sh
  2. ...
  3. Input: this is my own tagger and parser
  4. Parse:
  5. tagger NN ROOT
  6. +-- this DT nsubj
  7. +-- is VBZ cop
  8. +-- my PRP$ nmod:poss
  9. +-- own JJ amod
  10. +-- and CC cc
  11. +-- parser NN conj
  12. # original model
  13. $ echo "this is my own tagger and parser" | ./demo.sh
  14. Input: this is my own tagger and parser
  15. Parse:
  16. tagger NN ROOT
  17. +-- this DT nsubj
  18. +-- is VBZ cop
  19. +-- my PRP$ poss
  20. +-- own JJ amod
  21. +-- and CC cc
  22. +-- parser ADD conj
  23. $ echo "Bob brought the pizza to Alice ." | ./test.sh
  24. Input: Bob brought the pizza to Alice .
  25. Parse:
  26. brought VBD ROOT
  27. +-- Bob NNP nsubj
  28. +-- pizza NN dobj
  29. | +-- the DT det
  30. +-- Alice NNP nmod
  31. | +-- to IN case
  32. +-- . . punct
  33. # original model
  34. $ echo "Bob brought the pizza to Alice ." | ./demo.sh
  35. Input: Bob brought the pizza to Alice .
  36. Parse:
  37. brought VBD ROOT
  38. +-- Bob NNP nsubj
  39. +-- pizza NN dobj
  40. | +-- the DT det
  41. +-- to IN prep
  42. | +-- Alice NNP pobj
  43. +-- . . punct

training parser from Sejong treebank corpus

  1. # the corpus is accessible through the path on this image : https://raw.githubusercontent.com/dsindex/blog/master/images/url_sejong.png
  2. # copy sejong_treebank.txt.v1 to `sejong` directory.
  3. $ ./sejong/split.sh
  4. $ ./sejong/c2d.sh
  5. $ ./train_sejong.sh
  6. #pretrain parser
  7. ...
  8. NFO:tensorflow:Seconds elapsed in evaluation: 14.18, eval metric: 93.43%
  9. ...
  10. #evaluate pretrained parser
  11. INFO:tensorflow:Seconds elapsed in evaluation: 116.08, eval metric: 95.11%
  12. INFO:tensorflow:Seconds elapsed in evaluation: 14.60, eval metric: 93.76%
  13. INFO:tensorflow:Seconds elapsed in evaluation: 14.45, eval metric: 93.78%
  14. ...
  15. #evaluate pretrained parser by eoj-based
  16. accuracy(UAS) = 0.903289
  17. accuracy(UAS) = 0.876198
  18. accuracy(UAS) = 0.876888
  19. ...
  20. #train parser
  21. INFO:tensorflow:Seconds elapsed in evaluation: 137.36, eval metric: 94.12%
  22. ...
  23. #evaluate parser
  24. INFO:tensorflow:Seconds elapsed in evaluation: 1806.21, eval metric: 96.37%
  25. INFO:tensorflow:Seconds elapsed in evaluation: 224.40, eval metric: 94.19%
  26. INFO:tensorflow:Seconds elapsed in evaluation: 223.75, eval metric: 94.25%
  27. ...
  28. #evaluate parser by eoj-based
  29. accuracy(UAS) = 0.928845
  30. accuracy(UAS) = 0.886139
  31. accuracy(UAS) = 0.887824
  32. ...

test korean parser model

  1. $ cat sejong/tagged_input.sample
  2. 1 프랑스 프랑스 NNP NNP _ 0 _ _ _
  3. 2 JKG JKG _ 0 _ _ _
  4. 3 세계 세계 NNG NNG _ 0 _ _ _
  5. 4 XSN XSN _ 0 _ _ _
  6. 5 VCP VCP _ 0 _ _ _
  7. 6 ETM ETM _ 0 _ _ _
  8. 7 의상 의상 NNG NNG _ 0 _ _ _
  9. 8 디자이너 디자이너 NNG NNG _ 0 _ _ _
  10. 9 엠마누엘 엠마누엘 NNP NNP _ 0 _ _ _
  11. 10 웅가로 웅가로 NNP NNP _ 0 _ _ _
  12. 11 JKS JKS _ 0 _ _ _
  13. 12 실내 실내 NNG NNG _ 0 _ _ _
  14. 13 장식 장식 NNG NNG _ 0 _ _ _
  15. 14 XSN XSN _ 0 _ _ _
  16. 15 직물 직물 NNG NNG _ 0 _ _ _
  17. 16 디자이너 디자이너 NNG NNG _ 0 _ _ _
  18. 17 JKB JKB _ 0 _ _ _
  19. 18 나서 나서 VV VV _ 0 _ _ _
  20. 19 EP EP _ 0 _ _ _
  21. 20 EF EF _ 0 _ _ _
  22. 21 . . SF SF _ 0 _ _ _
  23. $ cat sejong/tagged_input.sample | ./test_sejong.sh -v -v
  24. Input: 프랑스 세계 의상 디자이너 엠마누엘 웅가로 실내 장식 직물 디자이너 나서 .
  25. Parse:
  26. . SF ROOT
  27. +-- EF MOD
  28. +-- EP MOD
  29. +-- 나서 VV MOD
  30. +-- JKS NP_SBJ
  31. | +-- 웅가로 NNP MOD
  32. | +-- 디자이너 NNG NP
  33. | | +-- JKG NP_MOD
  34. | | | +-- 프랑스 NNP MOD
  35. | | +-- ETM VNP_MOD
  36. | | | +-- VCP MOD
  37. | | | +-- XSN MOD
  38. | | | +-- 세계 NNG MOD
  39. | | +-- 의상 NNG NP
  40. | +-- 엠마누엘 NNP NP
  41. +-- JKB NP_AJT
  42. +-- 디자이너 NNG MOD
  43. +-- 직물 NNG NP
  44. +-- 실내 NNG NP
  45. +-- XSN NP
  46. +-- 장식 NNG MOD

apply korean POS tagger(Komoran via konlpy)

  1. # after installing konlpy ( http://konlpy.org/ko/v0.4.3/ )
  2. $ python sejong/tagger.py
  3. 나는 학교에 간다.
  4. 1 NP NP _ 0 _ _ _
  5. 2 JX JX _ 0 _ _ _
  6. 3 학교 학교 NNG NNG _ 0 _ _ _
  7. 4 JKB JKB _ 0 _ _ _
  8. 5 VV VV _ 0 _ _ _
  9. 6 ㄴ다 ㄴ다 EF EF _ 0 _ _ _
  10. 7 . . SF SF _ 0 _ _ _
  11. $ echo "나는 학교에 간다." | python sejong/tagger.py | ./test_sejong.sh
  12. Input: 학교 ㄴ다 .
  13. Parse:
  14. . SF ROOT
  15. +-- ㄴ다 EF MOD
  16. +-- VV MOD
  17. +-- JX NP_SBJ
  18. | +-- NP MOD
  19. +-- JKB NP_AJT
  20. +-- 학교 NNG MOD

tensorflow serving and syntaxnet

  • using tensorflow serving
  • my summary
    1. $ bazel-bin/tensorflow_serving/example/parsey_client --server=localhost:9000
    2. 나는 학교에 간다
    3. Input : 나는 학교에 간다
    4. Parsing :
    5. {"result": [{"text": "나 는 학교 에 가 ㄴ다", "token": [{"category": "NP", "head": 1, "end": 2, "label": "MOD", "start": 0, "tag": "NP", "word": "나"}, {"category": "JX", "head": 4, "end": 6, "label": "NP_SBJ", "start": 4, "tag": "JX", "word": "는"}, {"category": "NNG", "head": 3, "end": 13, "label": "MOD", "start": 8, "tag": "NNG", "word": "학교"}, {"category": "JKB", "head": 4, "end": 17, "label": "NP_AJT", "start": 15, "tag": "JKB", "word": "에"}, {"category": "VV", "head": 5, "end": 21, "label": "MOD", "start": 19, "tag": "VV", "word": "가"}, {"category": "EC", "end": 28, "label": "ROOT", "start": 23, "tag": "EC", "word": "ㄴ다"}], "docid": "-:0"}]}
    6. ...

parsey’s cousins

for English

$ echo “Bob brought the pizza to Alice.” | ./parse.sh

tokenizing

Bob brought the pizza to Alice .

morphological analysis

1 Bob Number=Sing|fPOS=PROPN++NNP 0
2 brought Mood=Ind|Tense=Past|VerbForm=Fin|fPOS=VERB++VBD 0
3 the Definite=Def|PronType=Art|fPOS=DET++DT 0
4 pizza Number=Sing|fPOS=NOUN++NN 0
5 to fPOS=ADP++IN 0
6 Alice Number=Sing|fPOS=PROPN++NNP 0
7 . fPOS=PUNCT++. 0

tagging

1 Bob PROPN NNP Number=Sing|fPOS=PROPN++NNP 0
2 brought VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin|fPOS=VERB++VBD 0
3 the DET DT Definite=Def|PronType=Art|fPOS=DET++DT 0
4 pizza NOUN NN Number=Sing|fPOS=NOUN++NN 0
5 to ADP IN fPOS=ADP++IN 0
6 Alice PROPN NNP Number=Sing|fPOS=PROPN++NNP 0
7 . PUNCT . fPOS=PUNCT++. 0

parsing

1 Bob PROPN NNP Number=Sing|fPOS=PROPN++NNP 2 nsubj
2 brought
VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin|fPOS=VERB++VBD 0 ROOT
3 the DET DT Definite=Def|PronType=Art|fPOS=DET++DT 4 det
4 pizza
NOUN NN Number=Sing|fPOS=NOUN++NN 2 dobj
5 to ADP IN fPOS=ADP++IN 6 case
6 Alice
PROPN NNP Number=Sing|fPOS=PROPN++NNP 2 nmod
7 . PUNCT . fPOS=PUNCT++. 2 punct _

conll2tree

Input: Bob brought the pizza to Alice .
Parse:
brought VERB++VBD ROOT
+— Bob PROPN++NNP nsubj
+— pizza NOUN++NN dobj
| +— the DET++DT det
+— Alice PROPN++NNP nmod
| +— to ADP++IN case
+— . PUNCT++. punct

  1. - downloaded model vs trained model
  2. ```shell
  3. 1. downloaded model
  4. Language No. tokens POS fPOS Morph UAS LAS
  5. -------------------------------------------------------
  6. English 25096 90.48% 89.71% 91.30% 84.79% 80.38%
  7. 2. trained model
  8. INFO:tensorflow:Total processed documents: 2077
  9. INFO:tensorflow:num correct tokens: 18634
  10. INFO:tensorflow:total tokens: 22395
  11. INFO:tensorflow:Seconds elapsed in evaluation: 19.85, eval metric: 83.21%
  12. 3. where does the difference(84.79% - 83.21%) come from?
  13. as mentioned https://research.googleblog.com/2016/08/meet-parseys-cousins-syntax-for-40.html
  14. they found good hyperparameters by using MapReduce.
  15. for example,
  16. the hyperparameters for POS tagger :
  17. - POS_PARAMS=128-0.08-3600-0.9-0
  18. - decay_steps=3600
  19. - hidden_layer_sizes=128
  20. - learning_rate=0.08
  21. - momentum=0.9

dragnn

  • how to compile examples
    1. $ cd ../
    2. $ pwd
    3. /path/to/models/syntaxnet
    4. $ bazel build -c opt //examples/dragnn:tutorial_1
  • training tagger and parser with CoNLL corpus
    1. # compile
    2. $ pwd
    3. /path/to/models/syntaxnet
    4. $ bazel build -c opt //work/dragnn_examples:write_master_spec
    5. $ bazel build -c opt //work/dragnn_examples:train_dragnn
    6. $ bazel build -c opt //work/dragnn_examples:inference_dragnn
    7. # training
    8. $ cd work
    9. $ ./train_dragnn.sh -v -v
    10. ...
    11. INFO:tensorflow:training step: 25300, actual: 25300
    12. INFO:tensorflow:training step: 25400, actual: 25400
    13. INFO:tensorflow:finished step: 25400, actual: 25400
    14. INFO:tensorflow:Annotating datset: 2002 examples
    15. INFO:tensorflow:Done. Produced 2002 annotations
    16. INFO:tensorflow:Total num documents: 2002
    17. INFO:tensorflow:Total num tokens: 25148
    18. INFO:tensorflow:POS: 85.63%
    19. INFO:tensorflow:UAS: 79.67%
    20. INFO:tensorflow:LAS: 74.36%
    21. ...
    22. # test
    23. $ echo "i love this one" | ./test_dragnn.sh
    24. Input: i love this one
    25. Parse:
    26. love VBP root
    27. +-- i PRP nsubj
    28. +-- one CD obj
    29. +-- this DT det
  • training parser with Sejong corpus
    1. # compile
    2. $ pwd
    3. /path/to/models/syntaxnet
    4. $ bazel build -c opt //work/dragnn_examples:write_master_spec
    5. $ bazel build -c opt //work/dragnn_examples:train_dragnn
    6. $ bazel build -c opt //work/dragnn_examples:inference_dragnn_sejong
    7. # training
    8. $ cd work
    9. # to prepare corpus, please refer to `training parser from Sejong treebank corpus` section.
    10. $ ./train_dragnn_sejong.sh -v -v
    11. ...
    12. INFO:tensorflow:training step: 33100, actual: 33100
    13. INFO:tensorflow:training step: 33200, actual: 33200
    14. INFO:tensorflow:finished step: 33200, actual: 33200
    15. INFO:tensorflow:Annotating datset: 4114 examples
    16. INFO:tensorflow:Done. Produced 4114 annotations
    17. INFO:tensorflow:Total num documents: 4114
    18. INFO:tensorflow:Total num tokens: 97002
    19. INFO:tensorflow:POS: 93.95%
    20. INFO:tensorflow:UAS: 91.38%
    21. INFO:tensorflow:LAS: 87.76%
    22. ...
    23. # test
    24. # after installing konlpy ( http://konlpy.org/ko/v0.4.3/ )
    25. $ echo "제주로 가는 비행기가 심한 비바람에 회항했다." | ./test_dragnn_sejong.sh
    26. INFO:tensorflow:Read 1 documents
    27. Input: 제주 비행기 심하 비바람 회항 .
    28. Parse:
    29. . SF VP
    30. +-- EF MOD
    31. +-- EP MOD
    32. +-- XSA MOD
    33. +-- 회항 SN MOD
    34. +-- JKS NP_SBJ
    35. | +-- 비행기 NNG MOD
    36. | +-- ETM VP_MOD
    37. | +-- VV MOD
    38. | +-- JKB NP_AJT
    39. | +-- 제주 MAG MOD
    40. +-- JKB NP_AJT
    41. +-- 비바람 NNG MOD
    42. +-- SN MOD
    43. +-- 심하 VV NP
    44. # it seems that pos tagging results from the dragnn are somewhat incorrect.
    45. # so, i replace those to the results from the Komoran tagger.
    46. # you can modify 'inference_dragnn_sejong.py' to use the tags from the dragnn.
    47. Input: 제주 비행기 심하 비바람 회항 .
    48. Parse:
    49. . SF VP
    50. +-- EF MOD
    51. +-- EP MOD
    52. +-- XSV MOD
    53. +-- 회항 NNG MOD
    54. +-- JKS NP_SBJ
    55. | +-- 비행기 NNG MOD
    56. | +-- ETM VP_MOD
    57. | +-- VV MOD
    58. | +-- JKB NP_AJT
    59. | +-- 제주 NNG MOD
    60. +-- JKB NP_AJT
    61. +-- 비바람 NNG MOD
    62. +-- ETM MOD
    63. +-- 심하 VA NP
  • web api using tornado

    • how to run
      ```

      compile

      $ pwd
      /path/to/models/syntaxnet
      $ bazel build -c opt //work/dragnn_examples:dragnn_dm

      start tornado web api

      $ cd work/dragnn_examples/www

      start single process

      $ ./start.sh -v -v 0 0

      despite tornado suppoting multi-processing, session of tensorflow is not fork-safe.

      so do not use multi-processing option.

      if you want to link to the model trained by Sejong corpus, just edit env.sh

      : enable_konlpy=’True’

    http://hostip:8897

    http://hostip:8897/dragnn?q=i love it

    http://hostip:8897/dragnn?q=나는 학교에 가서 공부했다.

    ```
    view(sample)

brat annotation tool

comparison to BIST parser