Semantic Answer Type Prediction
IR-SMART contain the generated code for a university project located here.
Given a query formatted in natural language, the code should be able to predict the expected answer type from a set of candidate entitites from collected target ontology. In this project the target ontology used is from the DBpedia 2016 dump.
The project has utilized the following tools and libraries extensively:
To get a local copy up and running follow these simple steps. It is assumed that the user has jupyter notebook available, and it is recommended to use a Conda distribution(Anaconda/Miniconda).
Install the necessary python libraries(if conda is not used):
pip install --upgrade elasticsearch gensim numpy scipy scikit-learn
Other dependencies might exist, but they have been installed through conda-distribution
Due to overall size of dataset this has be downloaded separately:
Once all the files has been downloaded, extract them and place them in such a way that the directory structure is as follows(the files highlighted with ##
are the files you need to download&place yourself):
📦IR-SMART
┣ 📂datasets
┃ ┣ 📂DBpedia
┃ ┃ ┣ 📜instance_types_en.ttl ##
┃ ┃ ┣ 📜long_abstracts_en.ttl ##
┃ ┃ ┣ 📜smarttask_dbpedia_test_questions.json ##
┃ ┃ ┗ 📜smarttask_dbpedia_train.json ##
┃ ┣ 📂gensim
┃ ┃ ┗ 📜...
┃ ┗ 📂glove
┃ ┣ 📜glove.6B.100d.txt ##
┃ ┣ 📜glove.6B.200d.txt ##
┃ ┣ 📜glove.6B.300d.txt ##
┃ ┗ 📜glove.6B.50d.txt ##
┣ 📂results
┃ ┣ 📜advanced.csv
┃ ┣ 📜advanced_word2vec.csv
┃ ┣ 📜baseline.csv
┃ ┗ 📜test_type_predictions.csv
┣ 📜.gitignore
┣ 📜baseline_variable_test.ipynb
┣ 📜evaluation.ipynb
┣ 📜indexer.ipynb
┣ 📜indexer_compact.ipynb
┣ 📜LICENSE
┣ 📜README.md
┗ 📜trial_and_error.ipynb
The necessary code to execute is located in indexer_compact.ipynb
and evaluation.ipynb
The other ipynb-files, contain an alternative larger index(indexer.ipynb
), tests to see how varying parameter values affected the score(baseline_variable_test
). trial_and_error
contain a failed early attempt to make the ES-indexing more effective by first loading all datafiles into memory and then initializing ES-indexing(not recommended to run)
Execute all cells within indexer_compact.ipynb
, this will generate the ElasticSearch index necessary for all consecutive steps.
createTheIndex()
, in cell 5 to generate the index, and indexData(10000)
mear the bottom of the file.Execute all cells within evaluation.ipynb
, this will perform the evaluation using both the baseline and advanced implementation.
convertGlovetoGensim()
function call in cell 5, this is necessary to allow GenSim to parse the GloVe embedding-file.The achieved accuracy scores has been summarized in the table below:
Method | Accuracy | NDCG@5 | NDCG@10 |
---|---|---|---|
Strict Baseline | 0.492 | 0.237 | 0.323 |
Lenient Baseline | 0.492 | 0.312 | 0.414 |
Strict Word2Vec | 0.522 | 0.280 | 0.367 |
Lenient Word2Vec | 0.522 | 0.364 | 0.455 |
Strict LTR(pointwise) | 0.776 | 0.731 | 0.754 |
Lenient LTR(pointwise | 0.776 | 0.753 | 0.780 |
Distributed under the GPL-3.0 License. See LICENSE
for more information.