A PyTorch implementation of the paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
For a trained model to load into the decoder, use
BLEU scores for VGG19 (Orange) and ResNet152 (Red) Trained With Teacher Forcing.
BLEU Score | Graph | Top-K Accuracy | Graph |
---|---|---|---|
BLEU-1 | ![]() |
Training Top-1 | ![]() |
BLEU-2 | ![]() |
Training Top-5 | ![]() |
BLEU-3 | ![]() |
Validation Top-1 | ![]() |
BLEU-4 | ![]() |
Validation Top-5 | ![]() |
This was written in python3 so may not work for python2. Download the COCO dataset training and validation
images. Put them in data/coco/imgs/train2014
and data/coco/imgs/val2014
respectively. Put the COCO
dataset split JSON file from Deep Visual-Semantic Alignments
in data/coco/
. It should be named dataset.json
.
Run the preprocessing to create the needed JSON files:
python generate_json_data.py
Start the training by running:
python train.py
The models will be saved in model/
and the training statistics will be saved in runs/
. To see the
training statistics, use:
tensorboard --logdir runs
python generate_caption.py --img-path <PATH_TO_IMG> --model <PATH_TO_MODEL_PARAMETERS>
Original Theano Implementation
Neural Machine Translation By Jointly Learning to Align And Translate