项目作者: pbteja1998

项目描述 :
Scientific Paper Summarization
高级语言: Python
项目地址: git://github.com/pbteja1998/ire_project_18.git
创建时间: 2018-11-12T15:48:21Z
项目社区:https://github.com/pbteja1998/ire_project_18

开源协议:

下载


Summarization of Scientific Texts: A Rhetorical Approach

  • This project is based on Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status

  • The main idea behind this paper is to use a rhetorical approach for classifying different statements present in a scientific paper on basis of argumentative zoning.

  • This project builds towards automatic summarisation of scientific papers. We aim to classify each sentence within the research paper as one of the 7 rhetorical categories as mentioned below.

Annotation Based on Argumentative Zoning

Each of the statement in the paper is divided into following different categories

  • Aim - Specific research goal of the current paper
  • Textual - Makes reference to the structure of the current paper
  • Own - (Neutral) description of own work presented in current paper: Methodology, results, discussion
  • Background - Generally accepted scientific background
  • Contrast - Statements of comparison with or contrast to other work; weaknesses of other work
  • Basis - Statements of agreement with other work or continuation of other work
  • Other - (Neutral) description of other researchers’ work

On basis of the above rhetorical categories we do the argumentative zoning of the sentences present in the papers.

Features of a sentence

  • Location - Where in the document the sentence occurs
  • Section Structure - Where in the section does the sentence occurs, i.e. if a sentence is a first line of the section and so on.
  • Paragraph Structure - Whether a sentence occurs in the start, middle or the end of a paragraph.
  • Headline
  • Length - Whether the given line is a long line or not.
  • Title - If the words in the sentence occur in the title or not.
  • TF IDF Score - Whether the sentence consists of significant words or not.
  • Voice - What is the voice of the main verb of the sentence.
  • Tense - Tense of the main verb or aux verb of the sentence.
  • Modal - Just using the above concept we find whether there is an auxiliary verb with the main verb. If yes we give the corresponding values.

Approaches and Tools Used

We used existing argumentative zoning dataset and on that we created different feature vectors corresponding to each sentence, and then we trained a Naive Bayes classifier on the dataset. We did a test-train split of 0.8

We used NLTK and Scikit for writing the classifier. Since we used scikit learn we were able to test our model with multiple distributions.

We have used Naive Bayes with the following distributions:

  • Bernoulli Distribution
  • Gaussian Distribution
  • Multinomial Distribution
  • Complement Distribution

Results

Type Number of papers
Train Dataset 64 ( 80 % )
Test Dataset 15 ( 20 % )
Distribution Accuracy
Bernoulli 84.64
Gaussian 100
Multinomial 80.89
Complement 81.28

Plots:


Bernoulli Distribution Confusion Matrix

Bernoulli Distribution Confusion Matrix

Bernoulli Distribution Histogram

Bernoulli Distribution Histogram


Complement Distribution Confusion Matrix

Complement Distribution Confusion Matrix

Complement Distribution Histogram

Complement Distribution Histogram


Gaussian Distribution Confusion Matrix

Gaussian Distribution Confusion Matrix

Gaussian Distribution Histogram

Gaussian Distribution Histogram


Multinomial Distribution Confusion Matrix

Multinomial Distribution Confusion Matrix

Multinomial Distribution Histogram

Multinomial Distribution Histogram


Running the code

  1. To Generate summary of a given file
  2. $ python src/summary.py {relative_path_of_file_from_summary.py}
  3.  
  4. Example:
  5. $ python src/summary.py ../data/tagged/9405001.az-scixml
  1. To Train, Test and get accuracy of the classification of sentences
  2. $ python src/naive_bayes.py
  1. Running Flask app Locally
  2.  
  3. $ sudo apt-get install python-pip
  4. $ sudo pip install virtualenv
  5. $ virtualenv -p python venv
  6. $ source venv/bin/activate
  7. $ pip install -r requirements.txt
  8. $ export PORT=5000
  9. $ gunicorn -b :$PORT --chdir src app:app
  10.  
  1.  
  2. After running the above commands, go the the following url
  3. http://0.0.0.0:5000/
  4.  

References