项目作者: iconclub

项目描述 :
Python implementation of Glove
高级语言: C
项目地址: git://github.com/iconclub/python-glove.git
创建时间: 2021-07-26T13:07:53Z
项目社区:https://github.com/iconclub/python-glove

开源协议:Apache License 2.0

下载


python-glove

Python implementation of GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus,
and the resulting representations showcase interesting linear substructures of the word vector space.

Installation

From source:

  1. $ git clone https://github.com/iconclub/python-glove
  2. $ cd python-glove
  3. $ python setup.py install

Usage

  1. >>> from glove import Glove
  2. >>> model = Glove(corpus_file='test/corpus.txt', vector_size=100, window=5, min_count=5, epochs=10, verbose=True)
  3. BUILDING VOCABULARY
  4. Processed 3381866 tokens.
  5. Counted 24746 unique words.
  6. Truncating vocabulary at min count 5.
  7. Using vocabulary of size 7731.
  8. COUNTING COOCCURRENCES
  9. window size: 5
  10. context: symmetric
  11. max product: 10485784
  12. overflow length: 28521267
  13. Reading vocab from file "/Users/hieunguyen/Desktop/ICON/python-glove/glove/.tmp/vocab.txt"...loaded 7731 words.
  14. Building lookup table...table contains 28729425 elements.
  15. Processed 3381866 tokens.
  16. Writing cooccurrences to disk.......2 files in total.
  17. Merging cooccurrence files: processed 2680220 lines.
  18. Using random seed 42
  19. SHUFFLING COOCCURRENCES
  20. array size: 127506841
  21. Shuffling by chunks: processed 2680220 lines.
  22. Wrote 1 temporary file(s).
  23. Merging temp files: processed 2680220 lines.
  24. TRAINING MODEL
  25. Read 2680220 lines.
  26. Initializing parameters...Using random seed 42
  27. done.
  28. vector size: 100
  29. vocab size: 7731
  30. x_max: 100.000000
  31. alpha: 0.750000
  32. 07/27/21 - 10:33.06AM, iter: 001, cost: 0.061383
  33. 07/27/21 - 10:33.08AM, iter: 002, cost: 0.044241
  34. 07/27/21 - 10:33.11AM, iter: 003, cost: 0.039158
  35. 07/27/21 - 10:33.13AM, iter: 004, cost: 0.036379
  36. 07/27/21 - 10:33.15AM, iter: 005, cost: 0.033546
  37. 07/27/21 - 10:33.19AM, iter: 006, cost: 0.030526
  38. 07/27/21 - 10:33.21AM, iter: 007, cost: 0.027430
  39. 07/27/21 - 10:33.23AM, iter: 008, cost: 0.024350
  40. 07/27/21 - 10:33.25AM, iter: 009, cost: 0.021456
  41. 07/27/21 - 10:33.27AM, iter: 010, cost: 0.019001
  42. >>> print(model.wv.vectors.shape)
  43. (7732, 100)
  44. >>> print(model.wv.vectors)
  45. [[-0.7832 0.230984 0.328523 ... -0.938997 -0.772137 0.827372]
  46. [ 0.119143 0.06323 0.773245 ... -0.802186 -1.225709 0.65204 ]
  47. [-0.382861 -0.607985 0.218486 ... -0.402255 -1.133209 0.395143]
  48. ...
  49. [-0.026736 0.005838 -0.052565 ... 0.016259 0.022208 -0.015785]
  50. [-0.017614 0.020005 -0.055972 ... 0.024249 0.039124 -0.055554]
  51. [-0.012019 0.008404 -0.034215 ... 0.026566 0.037037 -0.031336]]
  52. >>> print(model.wv.index_to_key[:10])
  53. [',', '.', 'là', 'tôi', 'một', 'có', 'và', 'những', 'chúng', 'của']
  54. >>> print(len(model.wv))
  55. 7732

Development

Pull requests are welcome.
Fun hacking!