项目作者: arshad115

项目描述 :
Public repo for my masters thesis "Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)" for University of Mannheim Masters in Business Informatics, Chair of Data and Web Science.
高级语言: Python
项目地址: git://github.com/arshad115/Uni-Mannheim-Masters-Thesis.git
创建时间: 2020-06-30T18:36:26Z
项目社区:https://github.com/arshad115/Uni-Mannheim-Masters-Thesis

开源协议:GNU General Public License v3.0

下载


Uni Mannheim - Business Informatics Masters Thesis - Arshad Mehmood

Public repo for my masters thesis for the chair of Data and Web science:

Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)

First of all the WebIsALOD dataset should be downloaded, extracted and saved in the data folder.

  1. Fix the dataset URI’s:
    To fix the dataset URI’s run the python script called fix_dataset_uris.py.

  2. Extract concept documents files and save preprocessed clean files:

    To save the clean preprocessed files run the python script called Read_And_Clean.py.

  3. Download Wikipedia data:

    Use the following script to download the latest Wikipedia English articles dump:

    1. curl O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  4. Preprocess Wikipedia data using Gensim:

    To preprocess the Wikipedia data use the Gensim‘s script:

    1. python -m gensim.scripts.make_wiki
  5. Train LDA model with Wikipedia data:

    wiki_wordids.txt and wiki_tfidf.mm files generated in the previous step are required by the models using Wikipedia data.

    To train the LDA models with Wikipedia data, run the python script called wiki_lda.py.

  6. Train LDA model with WebIsALOD data:

    To train the LDA models with WebIsALOD data, run the python script called webisalod_lda.py.

  7. Train HDP model:

    To train the LDA models with Wikipedia data, run the python script called wiki_hdp.py.

  8. Classification using only topic modeling:

    To run the classification model with only topic modeling, run the python script called polysemous_words.py.

  9. Classification using topic modeling and supervised machine learning algorithms:

    To run the classification model with only topic modeling, run the python script called supervised_classifier.py.