Public repo for my masters thesis "Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)" for University of Mannheim Masters in Business Informatics, Chair of Data and Web Science.
Public repo for my masters thesis for the chair of Data and Web science:
First of all the WebIsALOD dataset should be downloaded, extracted and saved in the data
folder.
Fix the dataset URI’s:
To fix the dataset URI’s run the python script called fix_dataset_uris.py
.
Extract concept documents files and save preprocessed clean files:
To save the clean preprocessed files run the python script called Read_And_Clean.py
.
Download Wikipedia data:
Use the following script to download the latest Wikipedia English articles dump:
curl –O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Preprocess Wikipedia data using Gensim:
To preprocess the Wikipedia data use the Gensim‘s script:
python -m gensim.scripts.make_wiki
Train LDA model with Wikipedia data:
wiki_wordids.txt
and wiki_tfidf.mm
files generated in the previous step are required by the models using Wikipedia data.
To train the LDA models with Wikipedia data, run the python script called wiki_lda.py
.
Train LDA model with WebIsALOD data:
To train the LDA models with WebIsALOD data, run the python script called webisalod_lda.py
.
Train HDP model:
To train the LDA models with Wikipedia data, run the python script called wiki_hdp.py
.
Classification using only topic modeling:
To run the classification model with only topic modeling, run the python script called polysemous_words.py
.
Classification using topic modeling and supervised machine learning algorithms:
To run the classification model with only topic modeling, run the python script called supervised_classifier.py
.