Abstractive text summarization based on deep learning and semantic content generalization
This source code has been used in the experimental procedure of the following paper:
Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis. 2019. Abstractive text summarization based on deep learning and semantic content generalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5082-5092.
This paper is accessible in the Proceedings of the 57th ACL Annual Meeting (2019) or directly from here.
For citing, the BibTex follows:
@inproceedings{kouris2019abstractive,
title={Abstractive text summarization based on deep learning and semantic content generalization},
author={Kouris, Panagiotis and Alexandridis, Georgios and Stafylopatis, Andreas},
booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
month = jul,
year={2019}
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
url = {https://www.aclweb.org/anthology/P19-1501},
pages={5082--5092},
}
The code described below follows the methodology and the assumptions which are described in detail in the aforementioned paper.
The experimental procedure, as it is described in the paper, requires as initial dataset for training, validation and testing the Gigaword dataset as it is described by Rush et. al. 2015 (see references in the paper). Also for testing, the DUC 2004 dataset is used as this is also described in the paper.
According to the paper, the initial dataset is preprocessed furthermore and generalized to one of the proposed text generalization strategies (e.g. NEG100 or LG200d5). Then the generalized dataset is used for training where the deep learning model learns to predict a generalized summary.
In the phase of testing, a generalized article (e.g. an article of the test set) is given as input to the deep learning model which predicts the respective generalized summary. Then, in the phase of post-processing, the generalized concepts of the generalized summary are replaced by the specific concepts of the original (preprocessed) article producing the final summary.
The workflow of this framework follows:
Text generalization
Both text generalization tasks, NEG and LG, are performed by DataPreprocessing class (preprocessing.py file).
Firstly, part-of-speach tagging is required which is performed by pos_tagging_of_dataset_and_vocabulary_of_words_pos_frequent() method for Gigaword dataset and pos_tagging_of_duc_dataset_and_vocab_pos_frequent() method for DUC dataset. Then the NEG and LG strategy can be applied as follows:
NEG Strategy
The annotation of named entities is performed by ner_of_dataset_and_vocabulary_of_ner_words() method for Gigaword dataset and ner_of_duc_dataset_and_vocab_of_ne() method for DUC dataset. Then the methods conver_dataset_with_ner_from_stanford_and_wordnet() for Gigaword dataset and conver_duc_dataset_with_ner_from_stanford_and_wordnet() for DUC dataset generalize these datasets according to NEG strategy having set the parameters accordingly.
LG Strategy
The word_freq_hypernym_paths() method produces a file that contains a vocabulary with the frequency and the hypernym path of each word. Then this file is used by vocab_based_on_hypernyms() method in order to produce a file that contains a vocabulary with those words that are candidates for generalization. Finally, for the Gigaword dataset, the convert_dataset_to_general() method produces the files with summary-article pairs which constitute the generalized dataset, while for DUC dataset the convert_duc_dataset_based_on_level_of_generalizetion() method is used. The hyperparameters of these methods should be set accordingly.
python build_dataset.py -mode train -model lg100d5g
python build_dataset.py -mode validation -model lg100d5g
python build_dataset.py -mode test -model lg100d5g
Training
The process of training is performed by Train Class (file train_v2.py) having set the hyperparameters accordingly. The files which are produced from the previous step of Building dataset are used as input in this phase of training.
The process of training is performed by the command: python train.py -model neg100
, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).
Post-processing of generalized summaries
In the phase of testing, the task of post-processing of the generalized summaries, which are produced by the deep learning model, is required to replace the generalized concepts of the generalized summary with the specific ones from the corresponding original articles. This task is performed by PostProcessing class by setting the parameters in \_init__() method accordingly. More specifically, the mode should be set to “lg” or “neg” according to the employed text generalization strategy. Also, the hyperparameters of _neg_postprocessing() and lg_postprocessing() methods for file paths, text similarity function and the context window should be set accordingly.
python testing.py -mode gigaword
python testing.py -mode duc
python testing.py -mode duc75b
Setting parameters and paths
The values of hyperparameters should be specified in the file parameters.py, while the paths of the corresponding files should be set in the file paths.py.
Additionally, a file with word embeddings (e.g. word2vec) is required where its file path and the dimension of the vectors (e.g. 300) should be specified in the files paths.py and parameters.py, respectively.
The project was developed in python 3.5 and the required python packages are included in the file requirements.txt.
The above described code includes the functionality that was used in the experimental procedure of the corresponding paper. However, the proposed framework is not limited by the current implementation as it is based on a well defined theoretical model that may provide the possibility of enhancing its performance by extending or improving this implementation (e.g. using a better taxonomy of concepts, a different machine learning model or an alternative similarity method for the post-processing task).