项目作者: jayleicn

项目描述 :
[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset
高级语言: Python
项目地址: git://github.com/jayleicn/TVCaption.git
创建时间: 2020-01-27T01:58:09Z
项目社区:https://github.com/jayleicn/TVCaption

开源协议:MIT License

下载


TVCaption

PyTorch implementation of MultiModal Transformer (MMT), a method for multimodal (video + subtitle) captioning.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Jie Lei, Licheng Yu,
Tamara L. Berg, Mohit Bansal

TVC Dataset and Task

We extended TVR by collecting extra captions
for each annotated moment. This dataset, named TV show Captions (TVC),
is a large-scale multimodal video captioning dataset,
contains 262K captions paired with 108K moments.
We show our annotated captions and model generated captions below.
Similar to TVR, the TVC task requires systems to gather information
from both video and subtitle to generate relevant descriptions.
tvc example

Method: MultiModal Transformer (MMT)



we designed a MultiModal Transformer (MMT) captioning model which
follows the classical encoder-decoder transformer architecture. It takes both
video and subtitle as encoder inputs to generate the captions from the decoder.

Resources

Getting started

Prerequisites

  1. Clone this repository

    1. git clone --recursive https://github.com/jayleicn/TVCaption.git
    2. cd TVCaption
  2. Prepare feature files
    Download tvc_feature_release.tar.gz (23GB).
    After downloading the file, extract it to the data directory.

    1. tar -xf path/to/tvc_feature_release.tar.gz -C data

    You should be able to see video_feature under data/tvc_feature_release directory.
    It contains video features (ResNet, I3D, ResNet+I3D), these features are the same as the video features
    we used for TVR/XML.
    Read the code to learn details on how the features are extracted:
    video feature extraction.

  1. Install dependencies:
  • Python 2.7
  • PyTorch 1.1.0
  • nltk
  • easydict
  • tqdm
  • h5py
  • tensorboardX
  1. Add project root to PYTHONPATH
    1. source setup.sh
    Note that you need to do this each time you start a new session.

Training and Inference

  1. Build Vocabulary
    1. bash baselines/transformer_captioning/scripts/build_vocab.sh
    Running this command will build vocabulary cache/tvc_word2idx.json from TVC train set.
  1. MMT training
    1. bash baselines/multimodal_transformer/scripts/train.sh CTX_MODE VID_FEAT_TYPE
    CTX_MODE refers to the context (video, sub, video_sub) we use.
    VID_FEAT_TYPE video feature type (resnet, i3d, resnet_i3d).

Below is an example of training MMT with both video and subtitle, where we use
the concatenation of ResNet and I3D features for video.

  1. bash baselines/multimodal_transformer/scripts/train.sh video_sub resnet_i3d

This code will load all the data (~30GB) into RAM to speed up training,
use --no_core_driver to disable this behavior.

Training using the above config will stop at around epoch 22, around 3 hours with a single 2080Ti GPU.
You should get ~45.0 CIDEr-D and ~10.5 BLEU@4 scores on val split.
The resulting model and config will be saved at a dir: baselines/multimodal_transformer/results/video_sub-res-*

  1. MMT inference
    After training, you can inference using the saved model on val or test_public split:
    1. bash baselines/multimodal_transformer/scripts/translate.sh MODEL_DIR_NAME SPLIT_NAME
    MODEL_DIR_NAME is the name of the dir containing the saved model,
    e.g., video_sub-res-*. SPLIT_NAME could be val or test_public.

Evaluation and Submission

We only release ground-truth for train and val splits, to get results on test-public split,
please submit your results follow the instructions here:
standalone_eval/README.md

Citations

If you find this code useful for your research, please cite our paper:

  1. @inproceedings{lei2020tvr,
  2. title={TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval},
  3. author={Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit},
  4. booktitle={ECCV},
  5. year={2020}
  6. }

Acknowledgement

This research is supported by grants and awards from NSF, DARPA, ARO and Google.

This code borrowed components from the following projects:
recurrent-transformer,
OpenNMT-py,
transformers,
coco-caption,
we thank the authors for open-sourcing these great projects!

Contact

jielei [at] cs.unc.edu