Nonparametric topic modeling via Hierarchical Dirichlet Processes
Authors: Eduardo Coronado and Andrew Carr (Duke University)
The hdp
package provides tools to set-up and train a Hierarchical Dirichlet Process (HDP) for topic modeling. This is similar to a Latent Dirichlet Allocation (LDA) model, with one major difference - HDPs are non-parametric in that the topics are learned from the data rather than user-specified.
This packages has the following dependencies
Pybind11 = 2.5
Eigen (C++ linear algebra library)
To install the Python dependencies, open up the terminal and run the following commands
pip3 install -r requirements.txt
To install the external C++ libraries, run the following command on terminal (note: it is important that you do so on the hdp root folder).
python3 ./src/clone_extensions.py
The scripts won’t install anything if everything is up to date.
Once the dependencies are installed, run the following script on terminal
chmod 755 ./INSTALL.sh
./INSTALL.sh
or individuallyou run the following commands
python3 setup.py build
python3 setup.py install
Let’s get started by importing the package and it’s main functions run_preprocess
and run_hdp
. The former provides an API to preprocess text data in CSV format while the later computes the inference.
import hdp
from hdp.text_prep import run_preprocess
from hdp.HDP import run_hdp
You can test these functions with some test data in the ./data
folder. For example, to preprocess files and obtain the corpus vocabulary and an inference-ready document structure use the following commands
url = './data/tm_test_data.csv'
vocab, docs = run_preprocess(url)
Subsequently, you can generate inferences using the hdp
function with the following commands
import numpy as np
# number of iteration
it = 5
# Hyperparameters (user-defined)
beta = 0.5 # topic concentration (LDA), can be user-defined
alpha = np.random.gamma(1,1) # DP mixture hyperparam (or user-defined float > 0)
gamma = np.random.gamma(1,1) # Base DP hyperparam (or user-defined float >0)
doc_arrays, topic_idx, n_kv, m_k, perplex = run_hdp(docs, vocab, gamma, alpha, beta, epochs=it)
For more information on how to run these functions see the api section below or visit the ./report
folder which provides context and theory behind the implementation.
Parameters
Returns
vocab:array
docs:list(sub_list), len = num of docs
Computes inference on document corpus and vocabulary over a user defined number of epochs (iterations). Additionally, user must provide prior distribution hyperparameters similar to those needed in LDA
Parameters
docs:list(sub_list), len = num of docs
vocab:array
gamma:float
alpha:float
beta:float
epochs:int (default = 1)
Returns
doc_arrays: list(Dict), len = num of docs
t_j
: array of table indexes in document k_jt
: array of topics assigned to tables in t_j
n_jt
: count array of words assigned to each table/topictopic_idx: list
n_kv: ndarray
topic_idx
during inference. Use the topic_idx
to select the appropriate cols.m_k:array
perplex:list
Unit tests can be run using the following command on terminal in the package root folder
python3 setup.py test