基于 gensim 的 wiki 词向量
基于 gensim 的 wiki 词向量
curl -o data/zhwiki/zhwiki-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
python 01.xml2string.py
$ cd /data/zhwiki/
$ opencc -i zhwiki_raw.txt -o zhwiki_t2s.txt -c t2s.json
python 02.word_segment.py
python 03.word2vec.py
python 04.word_similarity.py
生成可视化
python w2v_visualizer.py embedding_model_t2s/zhwiki_embedding_t2s.model visualize_result
运行结果
Using TensorFlow backend.
2017-09-02 23:15:04.010950: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 23:15:04.010973: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 23:15:04.010978: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 23:15:04.010982: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Run `tensorboard --logdir=visualize_result` to run visualize result on tensorboard
运行可视化
tensorboard --logdir=visualize_result