KB_QA (Knowledge Base Question Answering)

Introduction

This is knowledge base QA task with the data from http://tcci.ccf.org.cn/conference/2017/taskdata.php (task 5: Open Domain Question Answering).
Chinese version introduction: https://blog.csdn.net/m0_37531129/article/details/103321814

My development environment is ubuntu 14.04 with pytorch 1.2.0.
Please make sure ‘mysql’ was installed in your desktop/PC. There are many guidances for mysql installing in Ubuntu.

You can clone or download this ‘KB_QA’ repository.

please ‘cd preProcessData’

1. run splitTest.py
  
  There are train and test dataset in NLPCC2017 task5. We can spilt test data by 1:1 to get test and dev data.
1. run preCleanData.py
  
  There are three functions in this script: getNERData, getDBData and getSimilarityData.
  You will get three folders named NERData, DBData, SIMData.
1. run uploadDB.py
  
  (pls create a KB_QA database in mysql).
  Running the script, it will upload data in DBData folder to KB_QA database.

1. Pls download Bert Pretraining model(pytorch version) adding them to subModel/BertPreTrainedModel.
  and pls make sure NERData folder existed under preProcessData/NERData, there should be three text files(train.txt, dev.txt, test.txt)
1. training NER model
  
  run NERMain.py —data_dir preProcessData/NERData —vocab_file BertPreTrainedModel/vocab.txt —model_config BertPreTrainedModel/conig.json —output_dir output_model —pre_train_model BertPreTrainedModel/pytorch_model.bin —max_seq_length 64 —do_train —train_batch_size 8 —eval_batch_szie 8 —gradient_accumulation_steps 16 —num_train_epochs 8

1. Pls make sure Bert Pretraining model(pytorch version) under subModel/BertPreTrainedModel existed.
1. training classification model
  
  run SIMMain.py —data_dir preProcessData/SIMData —vocab_file BertPreTrainedModel/vocab.txt —model_config BertPreTrainedModel/config.json —output_dir output_model —pre_train_model BertPreTrainedModel/pytorch_model.bin —max_seq_length 64 —do_train —train_epoch_size 8 —eval_batch_size 8 —gradient_accumulation_steps 16 -num_train_epochs 8

python RunTask.py