项目作者: pprzetacznik

项目描述 :
Natural Language Processing - n-grams statistics
高级语言: Python
项目地址: git://github.com/pprzetacznik/nlp-n-grams.git
创建时间: 2015-03-18T00:24:30Z
项目社区:https://github.com/pprzetacznik/nlp-n-grams

开源协议:

下载


Natural Language Processing - n-grams statistics

NGrams CI

Run:

  1. $ pip install -r requirements.txt
  2. $ python -m ngrams --train-dir train_corpus --n-gram 3 --test-file test_corpus/thewire.txt
  3. Processing file: train_corpus/2momm10.txt
  4. Processing file: train_corpus/4momm10.txt
  5. Processing file: train_corpus/54.txt
  6. Processing file: train_corpus/5momm10.txt
  7. Processing file: train_corpus/8momm10.txt
  8. Processing file: train_corpus/finnish.txt
  9. Processing file: train_corpus/finnish1.txt
  10. Processing file: train_corpus/Harry Potter 1 Sorcerer's_Stone.txt
  11. Processing file: train_corpus/Harry Potter 2 Chamber_of_Secrets.txt
  12. Processing file: train_corpus/Harry Potter 3 Prisoner of Azkaban.txt
  13. Processing file: train_corpus/Harry Potter 4 and the Goblet of Fire.txt
  14. Processing file: train_corpus/polski.txt
  15. Processing file: train_corpus/polski2.txt
  16. Processing file: train_corpus/polski3.txt
  17. Processing file: train_corpus/q.txt
  18. Processing file: train_corpus/spanish.txt
  19. Processing file: train_corpus/spanish1.txt
  20. ######## RECOMMENDATIONS ###########
  21. Harry Potter 1 Sorcerer's_Stone.txt : 0.8516218202150778
  22. Harry Potter 4 and the Goblet of Fire.txt : 0.8382954018753523
  23. Harry Potter 3 Prisoner of Azkaban.txt : 0.8297737283348564
  24. Harry Potter 2 Chamber_of_Secrets.txt : 0.8286664546220112
  25. q.txt : 0.2687367480112696
  26. 54.txt : 0.2654665173972638
  27. spanish1.txt : 0.2157763828702334
  28. 8momm10.txt : 0.20948116826277266
  29. 5momm10.txt : 0.20088947556841405
  30. 4momm10.txt : 0.20052011673047865
  31. 2momm10.txt : 0.1900199333318412
  32. spanish.txt : 0.18425240798457684
  33. finnish.txt : 0.1749251467750644
  34. finnish1.txt : 0.17453271188273536
  35. polski.txt : 0.12179996712775056
  36. polski3.txt : 0.09870171027877826
  37. polski2.txt : 0.09401455065003979

Tests

  1. $ pytest