Implementing Stick-Breaking Variational Auto-encoder using Python-2 and Theano
This reposiory contains Theano implementations of the models described in Deep Generative Models with Stick-Breaking Priors. Documentation and development still in progress.
The Stick-Breaking Autoencoder is a nonparametric reformulation of the Variational Autoencoder. The latent variables are drawn from a stick-breaking process, a combinatorial mechanism for sampling from an infinite distribution. This implementation uses a truncated variational approximation; see the paper for discussion on un-truncated approaches. The model’s feedforward architecture can be seen below. Note the cross-dependencies that are a by product of the stick-breaking process’ recursive nature.
The Stick-Breaking Autoencoder can be trained on MNIST by running:
Be sure to set the Theano flags appropriately for GPU usage. Running with the option --help
shows command line arguments for changing the dataset, architecure, and other hyperparameters.
The Gamma Stick-Breaking Autoencoder can be trained on MNIST by running:
The Gauss Logit Stick-Breaking Autoencoder can be trained on MNIST by running:
The Gauss Variable Autoencoder can be trained on MNIST by running:
Fully- and semi-supervised classification experiments can be run with the semi-supervised deep generative model with stick-breaking latent variables, a nonparametric reformulation of Kingma et al’s M2 model. This model is similar to the variational autoencoder, the change being that a class label is introduced as another latent variable that is marginalized when unobserved. The model’s feedforward architecture is diagrammed below.
The Stick Breaking Semi-Supervised Deep Generative Model can be trained on MNIST by running:
Again, be sure to set the Theano flags appropriately for GPU usage, and use the --help
option to see command line arguments.
The Gauss Semi-Supervised Deep Generative Model can be trained on MNIST by running:
The KNN Semi-Supervised model can be trained on MNIST by running:
There are a few issues to be aware of when training these stick-breaking models. The first is that one term in the KL divergence between the prior Beta distributions and the posterior Kumaraswamy distributions needs a Taylor series approximation. See the supplemental materials for the derivation. The code computes this approximation with the leading ten terms (hard-coded).
The second issue is that calculating the KL divergence involves the digamma function. The problem is that its derivative, the polygamma function, is not easy to implement in C, which is necessary for using the GPU. Here is the only C code I’m aware of that is capable of computing the polygamma, and I haven’t had the time to add the functionality to the Theano code base. As a result, this code employs another Taylor series approximation, expanded around zero, to compute the digamma function. For large values of b (Kumaraswamy parameter), the approximation gets poor, but this should not be an issue if the prior’s concentration parameter (the Beta’s beta value) is set to a reasonably small value (<
15). Due to these two approximations, it’s possible (but rare) to have a negative KL divergence.
Lastly, if the posterior’s truncation level is set too low, sampling the Kumaraswamy variables can cause NaNs. This occurs because the model tries to perform a hard thresholding of the latent representation, and to do this, the variational Kumaraswamy parameters must be set to very small or very large values. Very large values cause the Taylor approximations to become inaccurate, and very small values can cause the 1/a, 1/b terms to go to infinity. If NaNs are encountered, increasing the truncation level or clipping the parameters of the variational Kumaraswamys usually solve the problem.