Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension
This repository maintains C3, the first free-form multiple-Choice Chinese machine reading Comprehension dataset.
@article{sun2019investigating,
title={Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension},
author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},
journal={Transactions of the Association for Computational Linguistics},
year={2020},
url={https://arxiv.org/abs/1904.09679v3}
}
Files in this repository:
license.txt
: the license of C3.data/c3-{m,d}-{train,dev,test}.json
: the dataset files, where m and d represent “mixed-genre” and “dialogue”, respectively. The data format is as follows.
[
[
[
document 1
],
[
{
"question": document 1 / question 1,
"choice": [
document 1 / question 1 / answer option 1,
document 1 / question 1 / answer option 2,
...
],
"answer": document 1 / question 1 / correct answer option
},
{
"question": document 1 / question 2,
"choice": [
document 1 / question 2 / answer option 1,
document 1 / question 2 / answer option 2,
...
],
"answer": document 1 / question 2 / correct answer option
},
...
],
document 1 / id
],
[
[
document 2
],
[
{
"question": document 2 / question 1,
"choice": [
document 2 / question 1 / answer option 1,
document 2 / question 1 / answer option 2,
...
],
"answer": document 2 / question 1 / correct answer option
},
{
"question": document 2 / question 2,
"choice": [
document 2 / question 2 / answer option 1,
document 2 / question 2 / answer option 2,
...
],
"answer": document 2 / question 2 / correct answer option
},
...
],
document 2 / id
],
...
]
annotation/c3-{m,d}-{dev,test}.txt
: question type annotations. Each file contains 150 annotated instances. We adopt the following abbreviations:Abbreviation | Question Type | |
---|---|---|
Matching | m | Matching |
Prior knowledge | l | Linguistic |
s | Domain-specific | |
c-a | Arithmetic | |
c-o | Connotation | |
c-e | Cause-effect | |
c-i | Implication | |
c-p | Part-whole | |
c-d | Precondition | |
c-h | Scenario | |
c-n | Other | |
Supporting Sentences | 0 | Single Sentence |
1 | Multiple sentences | |
2 | Independent |
bert
folder: code of Chinese BERT, BERT-wwm, and BERT-wwm-ext baselines. The code is derived from this repository. Below are detailed instructions on fine-tuning Chinese BERT on C3. export BERT_BASE_DIR=/PATH/TO/BERT/DIR
. data
to bert/
.bert
, execute python convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin
.python run_classifier.py --task_name c3 --do_train --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 2e-5 --num_train_epochs 8.0 --output_dir c3_finetuned --gradient_accumulation_steps 3
.bert/c3_finetuned
.Note:
--seed
when executing run_classifier.py
).gradient_accumulation_steps
.