Central repository with pretrained models for transfer learning, BPE subword-tokenization, mono/multilingual embeddings, and everything in between.
This repository contains download streams for pretrained models for a variety of tasks, including instructions on how to load, modify, operate, and use them. This is being updated all the time when fresher models are trained, but do be patient as networks train very slow! If you have any questions, recommendations, comments, or model requests, do be sure to drop by our issues tracker.
To ensure that lab and intellectual property is kept intact, we’re currently employing a request form for pretrained models. Send us a request if you need a model using this link.
We provide pretrained BERT (Devlin, et al., 2019) models in Filipino, in both cased and uncased form. We only provide base models (12 layers, 12 heads, 768 units) as the large models are unwieldy and are very expensive to train (the latest one trained in a Google Cloud TPU for 2 weeks and it still lagged in performance behind the base models).
The pretrained models are in TensorFlow’s checkpoint format so it is compatibe for use with their TPU code (CPU usage and GPU usage has not been verified at this point, so we’d have to stick with TPUs). We can convert them to PyTorch checkpoints for use with GPUs. A modified copy of HuggingFace‘s BERT implementation is included in the repository, so be sure to check for installation instructions there. We have also modified this repository to allow more classification tasks other than GLUE and SQuAD.
Requirements
We provide a pretrained AWD-LSTM (Merity, et al., 2017) model in Filipino that can be finetuned to a classifier using ULMFiT (Howard & Ruder, 2018). We only provide one model. Do note that we still cannot release our training, finetuning, and scaffolding code as the work related to it is under review in a conference. We’ll update this repo as soon as anonymity period stops!
While we use our own handwritten scaffolding, we have done extra work to ensure that our pretrained checkpoints are compatible with this library called FastAI (yay!), which to this date, is still the only reliable implementation of ULMFiT. We’ll add in our own finetuning code (standalone, no need for extra packages) to this repository soonest we can. To use the model, you can follow the instructions in the FastAI repository.
Requirements
This is currently a work in progress. We will release a pretrained GPT-2 (Radford, et al., 2019) model soon in Filipino for you to use. We do, however, have demo code for finetuning a pretrained Transformer for classification in this repo.
Coming soon.
We provide a pretrained SentencePiece model for BPE subword tokenization (Seinnrich, et al., 2016). All modern translation models use subword tokenization now.
Requirements
We provide monolingual embeddings in both GloVe and Fasttext formats.
We provide multillingual and cross-lingual embeddings in the Fasttext format via MUSE (Lample, et al., 2018), as well as other models in other vector-alignment techniques.
We’ll share our large scale dumps of text corpora for training language models soonest the anonymity period drops for the conference our work is on. Stay tuned!
We would like to thank the TensorFow Research Cloud initiative for making TPUs more accessible, alllowing us to perform benchmarks on BERT models in Philippine languages. If you have any comments or concerns, do be sure to drop by our issues tracker!
This repository is managed by the De La Salle University Machine Learning Group