Citation Classification using hybrid neural network model for Wikipedia References
The documentation is written as WIKI in: DOCUMENTATION
A dataset of citations is extracted from English Wikipedia (date: May 2020) which comprises of 35 different templates such as cite news
, cite web
.
The dataset contains 29.276 million citations and then a subset is prepared which contains citations with identifiers which is 3.92 millions in size. The citations with identifiers dataset only covers the DOI, ISBN, PMC, PMID, ArXIV identifiers.
Along with the 2 dataset of citations, 2 frameworks are written to train the citations and get the classification - if the citation is scientific or not. Anyone is open to build models or do experiments using the extracted datasets and improve our results!
Please use the notebook minimal_dataset_demo.ipynb
to play with the minimal_dataset.zip
by downloading it from Zenodo (http://doi.org/10.5281/zenodo.3940692).
Assuming that Python is already installed (tested with version >= 2.7), the list of dependenices is written in requirements.txt
, the libraries can be installed using:
pip install requirements.txt
The notebooks can be accessed using:
jupyter notebook
README.md
this file.data/
libraries/
: Contains the libraries mwparserfromhell
and wikiciteparser
which have been changed for the scope of the project. To get all the datasets, the user would need to install these versions of the libraries.lookup
: Contains two scripts run_metadata.py
and get_apis.py
which can be used to query CrossRef and Google books. run_metadata.py
script is run asynchronously and right now can only be used for CrossRef. get_apis.py
uses the requests
library and can be used for querying short loads of metadata. Other files are related to the crossref evaluation to known what is the best heuristic and confidence threshold.notebooks
: Contains the notebooks which — Sanity_Check_Citations.ipynb
)Feature_Data_Analysis.ipynb
)citation_network_model_3_labels.ipynb
)results_predication_lookup.ipynb
)wild_examples_lookup_journal.ipynb
)scripts
: Contains all the scripts to generate the dataset and features. For each script, a description is given at the top. All the paths to files are currently absolute paths used to run the script — so please remember while running these scripts to change them.tests
: Some tests to check if the scripts for the data generation do what they are supposed to. Multiple tests would be added in the future to check the whole pipeline.
@misc{singh2020wikipedia,
title={Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia},
author={Harshdeep Singh and Robert West and Giovanni Colavizza},
year={2020},
eprint={2007.07022},
archivePrefix={arXiv},
primaryClass={cs.DL}
}