项目作者: soujanyaporia

项目描述 :
Multimodal Sarcasm Detection Dataset
高级语言: OpenEdge ABL
项目地址: git://github.com/soujanyaporia/MUStARD.git
创建时间: 2019-02-20T02:08:20Z
项目社区:https://github.com/soujanyaporia/MUStARD

开源协议:MIT License

下载


MUStARD: Multimodal Sarcasm Detection Dataset

Open in Colab

This repository contains the dataset and code for our ACL 2019 paper:

Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)

We release the MUStARD dataset, a multimodal video corpus for research in automated sarcasm discovery. The dataset
is compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and
Sarcasmaholics Anonymous. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is
accompanied by its context, providing additional information on the scenario where it occurs.

Example Instance

Example instance

Example sarcastic utterance from the dataset along with its context and transcript.

Raw Videos

We provide the raw video clips,
including both the utterances and their respective context

Data Format

The annotations and transcripts of the audiovisual clips are available at data/sarcasm_data.json.
Each instance in the JSON file is allotted one identifier (e.g., “1_60”), which is a dictionary of the following items:

Key Value
utterance The text of the target utterance to classify.
speaker Speaker of the target utterance.
context List of utterances (in chronological order) preceding the target utterance.
context_speakers Respective speakers of the context utterances.
sarcasm Binary label for sarcasm tag.

Example format in JSON:

  1. {
  2. "1_60": {
  3. "utterance": "It's just a privilege to watch your mind at work.",
  4. "speaker": "SHELDON",
  5. "context": [
  6. "I never would have identified the fingerprints of string theory in the aftermath of the Big Bang.",
  7. "My apologies. What's your plan?"
  8. ],
  9. "context_speakers": [
  10. "LEONARD",
  11. "SHELDON"
  12. ],
  13. "sarcasm": true
  14. }
  15. }

Citation

Please cite the following paper if you find this dataset useful in your research:

  1. @inproceedings{mustard,
  2. title = "Towards Multimodal Sarcasm Detection (An \_Obviously\_ Perfect Paper)",
  3. author = "Castro, Santiago and
  4. Hazarika, Devamanyu and
  5. P{\'e}rez-Rosas, Ver{\'o}nica and
  6. Zimmermann, Roger and
  7. Mihalcea, Rada and
  8. Poria, Soujanya",
  9. booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  10. month = "7",
  11. year = "2019",
  12. address = "Florence, Italy",
  13. publisher = "Association for Computational Linguistics",
  14. }

Run the code

  1. Set up the environment with Conda:

    1. conda env create
    2. conda activate mustard
    3. python -c "import nltk; nltk.download('punkt')"
  2. Download Common Crawl pretrained GloVe word vectors of size 300d, 840B tokens
    somewhere.

  3. Download the pre-extracted visual features to the data/ folder (so data/features/ contains the folders context_final/ and utterances_final/ with the features) or extract the visual features yourself.

  4. Download the pre-extracted BERT features and place the two files directly under the folder data/ (so they are data/bert-output.jsonl and data/bert-output-context.jsonl), or extract the BERT features in another environment with Python 2 and TensorFlow 1.11.0 following
    “Using BERT to extract fixed feature vectors (like ELMo)” from BERT’s repo
    and running:

    1. # Download BERT-base uncased in some dir:
    2. wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
    3. # Then put the location in this var:
    4. BERT_BASE_DIR=...
    5. python extract_features.py \
    6. --input_file=data/bert-input.txt \
    7. --output_file=data/bert-output.jsonl \
    8. --vocab_file=${BERT_BASE_DIR}/vocab.txt \
    9. --bert_config_file=${BERT_BASE_DIR}/bert_config.json \
    10. --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
    11. --layers=-1,-2,-3,-4 \
    12. --max_seq_length=128 \
    13. --batch_size=8
  5. Check the options in python train_svm.py -h to select a run configuration (or modify config.py) and then run it:

    1. python train_svm.py # Add the flags you want.
  6. Evaluation: We evaluate using a weighted F-score metric in a 5-fold cross-validation scheme. The fold indices are available at data/split_incides.p. Refer to our baseline scripts for more details.