My analysis of google nnlm zh in tensorflow hub.
<S>
, <UNK>
, </S>
hub.KerasLayer
的形式讀取的。tf.saved_model
將字典與權重綁定了.npy
檔,並用 tf.keras.initializers.Constant
方式放入 Keras Embedding Layer。tf.nn.embedding_lookup_sparse
與 sqrtn combiner 方式重現。NNLM-ZH
資料夾的內容是從 nnlm-zh-dim128-with-normalization 下載下來的。convert_s2t.py
內使用的 ZhConverter 套件是使用 MediaWiki 簡單實做出來的簡繁轉換系統。G##
等詞出現,雖然很有可能是用 #
代表數字,但尚未完成這方面的研究。'<UNK>'
或 '<S> <UNK> </S>'
的方式去還原單個未知詞的情況,但並沒有成功。