项目作者: clwatkins

项目描述 :
Doc2Vec implementation using JAX and Haiku
高级语言: Python
项目地址: git://github.com/clwatkins/doc2vec-jax.git
创建时间: 2021-02-09T08:08:38Z
项目社区:https://github.com/clwatkins/doc2vec-jax

开源协议:

下载


Doc2Vec JAX

This is a JAX-based implementation of Le and Mikolov’s Doc2Vec algorithm, which
builds on Word2Vec to generate document-level (bag of words) representations.

The paper proposes 2 main model variants — Paragraph Vector-Distributed Memory (PV-DM) and Distributed Bag of Words (DBOW) — although found PV-DM was more performant in most situations.

PV-DM architecture:

Doc2Vec PV-DM

DBOW architecture:

Doc2Vec DBOW

Codebase

  • doc2vec contains the core implementation, including code to prepare documents for training
  • experiments contains key experiments described in the original paper, reimplemented here

Installation

Via Docker

Pre-requisite: ensure nvidia-docker is installed

  1. $ sudo docker build -t doc2vec-jax .
  2. $ sudo docker run --gpus all -it -v ~/doc2vec:/doc2vec doc2vec-jax:latest /bin/bash
  3. $ poetry shell
  4. $ python -m doc2vec.train ...

TODO

  • PVDM model variant
  • DBOW model variant
  • Negative sampling
  • Sub-sampling
  • Parallelise training data generation