项目作者: zgornel

项目描述 :
Julia interface to GloVe
高级语言: Julia
项目地址: git://github.com/zgornel/Glowe.jl.git
创建时间: 2018-12-09T18:15:29Z
项目社区:https://github.com/zgornel/Glowe.jl

开源协议:MIT License

下载


Glowe

Julia interface to GloVe.

License
Build Status
Coverage Status

This package provides functionality for generating and working with GloVe word embeddings. The training is done using the original C code from the GloVe github repository.

Note that there is also a package called Glove.jl that provides a pure Julia implementation of the algorithm.

Installation

  1. Pkg.clone("https://github.com/zgornel/Glowe.jl")

for the latest master or

  1. Pkg.add("Glowe")

for the stable versions.

Documentation

Most of the documentation is provided in Julia’s native docsystem.

Examples

Following Word2Vec.jl’s example, considering the corpus from http://mattmahoney.net/dc/text8.zip extracted as text file text8 in the current working directory, the GloVe model can be obtained with:

  1. julia> # Training (may take a while)
  2. vocab_count("text8", "vocab.txt", min_count=5, verbose=1);
  3. cooccur("text8", "vocab.txt", "cooccurrence.bin", memory=8.0, verbose=1);
  4. shuffle("cooccurrence.bin", "cooccurrence.shuf.bin", memory=8.0, verbose=1);
  5. glove("cooccurrence.shuf.bin", "vocab.txt", "text8-vec", threads=8,
  6. x_max=10.0, iter=15, vector_size=300, binary=0, write_header=1,
  7. verbose=1);
  8. # BUILDING VOCABULARY
  9. # Truncating vocabulary at min count 5.
  10. # Using vocabulary of size 71290.
  11. #
  12. # COUNTING COOCCURRENCES
  13. # window size: 15
  14. # context: symmetric
  15. # Merging cooccurrence files: processed 60666468 lines.
  16. #
  17. # SHUFFLING COOCCURRENCES
  18. # array size: 510027366
  19. # Merging temp files: processed 60666468 lines.
  20. #
  21. # TRAINING MODEL
  22. # Read 60666468 lines.
  23. # vector size: 300
  24. # vocab size: 71290
  25. # x_max: 10.000000
  26. # alpha: 0.750000
  27. # 12/11/18 - 12:58.58AM, iter: 001, cost: 0.070201
  28. # 12/11/18 - 01:00.33AM, iter: 002, cost: 0.052521
  29. # ...

The model can be imported with

  1. model = wordvectors("text8-vec.txt", Float32, header=true, kind=:text)
  2. # WordVectors 71291 words, 300-element Float32 vectors

The vector representation of a word can be obtained using get_vector.

  1. julia> get_vector(model, "book")
  2. # 300-element Array{Float32,1}:
  3. # 0.006189716
  4. # 0.04822071
  5. # 0.017121462
  6. # ...

The cosine similarity of book, for example, can be computed using cosine_similar_words.

  1. julia> cosine_similar_words(model, "book")
  2. # 10-element Array{String,1}:
  3. # "book"
  4. # "books"
  5. # "published"
  6. # "domesday"
  7. # "novel"
  8. # "comic"
  9. # "written"
  10. # "bible"
  11. # "urantia"
  12. # "work"

Word vectors have many interesting properties. For example,
vector("king") - vector("man") + vector("woman") is close to vector("queen").

  1. julia> analogy_words(model, ["king", "woman"], ["man"])
  2. # 5-element Array{String,1}:
  3. # "queen"
  4. # "daughter"
  5. # "children"
  6. # "wife"
  7. # "son"

License

This code has an MIT license and therefore it is free.
GloVe is released under an Apache License v2.0.

References

[1] GloVe: Global Vectors for Word Representation

[2] Glove.jl - native Julia implementation

Acknowledgements

The design of the package relies on design concepts from the word2vec Julia interface, Word2Vec.jl.

Reporting Bugs

Please file an issue to report a bug or request a feature.