项目作者: sharmasapna

项目描述 :
topic modelling
高级语言: Jupyter Notebook
项目地址: git://github.com/sharmasapna/topic_modelling.git
创建时间: 2020-08-07T02:10:51Z
项目社区:https://github.com/sharmasapna/topic_modelling

开源协议:

下载


topic_modelling

Topic modelling with LDA.

  1. BOW - Bag of words approach
  2. TF-IDF Approach
    LDA assumes that documents consists of a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.
    Refer the following for detailed explanation about working of LDA model.

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/?#

Steps in Topic Modelling

  1. Loading data
  2. Data Cleaning
  3. Data transformation: Corpus and Dictionary
  4. Base Model Performance
  5. Hyperparameter Tuning
  6. Final Model
  7. Visualize Results
  8. Testing on Unseen Document

Cleaning and Preprocessing

  1. with open ('customized_stopwords', 'rb') as fp:
  2. customized_stopwords = pickle.load(fp)
  3. more_stop_words = ['finish','start','tomorrow','work','agree','think','middle','dicide','write','haven','understand','print','call','return','talk','happen']
  4. customized_stopwords=customized_stopwords + more_stop_words
  5. #stemmer = SnowballStemmer("english")
  6. def lemmatize(word): # input is a word that is to be converted to root word for verb
  7. return WordNetLemmatizer().lemmatize(word, pos = 'v')
  8. def preprocess(text):
  9. result=[]
  10. for token in gensim.utils.simple_preprocess(text) :
  11. if (token not in gensim.parsing.preprocessing.STOPWORDS) and (len(token) > 4) and (token not in customized_stopwords):
  12. if token not in customized_stopwords:
  13. result.append(lemmatize(token))
  14. if token == 'happen':
  15. print("yy")
  16. return result

Importing data from a bunch of documents

  1. combined_words = ""
  2. docs = []
  3. for transcript_file_name in glob.iglob('./transcripts/train//*.*', recursive=True):
  4. #print(os.path.basename(transcript_file_name))
  5. data = open(transcript_file_name).readlines()
  6. speaker_data = {line.split(":")[0]:line.split(":")[1] for line in data}
  7. words_in_file = ""
  8. speaker_dic ={}
  9. for name,words in speaker_data.items():
  10. words = words.replace("\n","").lower()
  11. words_in_file = words_in_file + words
  12. if name.split("_")[0] in speaker_dic:
  13. speaker_dic[name.split("_")[0]] += words
  14. else:
  15. speaker_dic[name.split("_")[0]] = words
  16. #print("Number of words in the file :",str(len(words_in_file)))
  17. combined_words += words_in_file
  18. docs.append([words_in_file])

Preparing data for LDA model

  1. cleaned_docs = []
  2. for doc in docs:
  3. for word in doc:
  4. cd = preprocess(word)
  5. cleaned_docs.append(cd)

Preparing dictionary,document-term-matrix for LDA implementaton and implementing LDA model

  1. dictionary = gensim.corpora.Dictionary(cleaned_docs)
  2. dictionary.filter_extremes(no_below=1, no_above=0.5, keep_n=100000) # optional
  3. bow_corpus = [dictionary.doc2bow(doc) for doc in cleaned_docs]
  4. ldamodels = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics = 4, id2word=dictionary, passes=30)

bow_corpus :
Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).
For example, (0, 7) above implies, word id 0 occurs seven times in the first document. Likewise, word id 1 occurs thrice and so on

Printing the output

  1. for i in ldamodels.print_topics(num_words = 18):
  2. for j in i: print (j)

Using pyLDAvis to visualize

  1. import pyLDAvis.gensim
  2. pyLDAvis.enable_notebook()
  3. vis = pyLDAvis.gensim.prepare(ldamodels, bow_corpus, dictionary=ldamodels.id2word)
  4. vis

Testing on unseen document

  1. unseen_doc_file_path = './transcripts/test/unseen_transcript.txt'
  2. combined_words = ""
  3. docs = []
  4. data = open(unseen_doc_file_path).readlines()
  5. speaker_data = {line.split(":")[0]:line.split(":")[1] for line in data}
  6. words_in_file = ""
  7. speaker_dic ={}
  8. for name,words in speaker_data.items():
  9. words = words.replace("\n","").lower()
  10. words_in_file = words_in_file + words
  11. if name.split("_")[0] in speaker_dic:
  12. speaker_dic[name.split("_")[0]] += words
  13. else:
  14. speaker_dic[name.split("_")[0]] = words
  15. #print("Number of words in the file :",str(len(words_in_file)))
  16. combined_words += words_in_file
  17. docs.append([words_in_file])
  18. cleaned_docs = []
  19. for doc in docs:
  20. for word in doc:
  21. cd = preprocess(word)
  22. cleaned_docs.append(cd)
  23. bow_vector = dictionary.doc2bow(cleaned_docs[0])
  24. for index, score in sorted(ldamodels[bow_vector], key=lambda tup: -1*tup[1]):
  25. print("Score: {}\t Topic: {}".format(score, ldamodels.print_topic(index, 7)))

The result was not as desired. There could be several reasons and I think in our case increasing the training data will improve model accuracy

Major drawbacks of bow.

  1. We need to create huge vectors with empty spaces in order to represent a number (sparse matrix) which consumes memory and space.
  2. It doesn’t maintain any context information. It doesn’t care about the order in which the words appear in a sentence. For instance, it treats the sentences “Bottle is in the car” and “Car is in the bottle” equally, which are totally different sentences.

Hyper Parameter tuning

The alpha and beta parameters
Here, alpha represents document-topic density - with a higher alpha, documents are made up of more topics, and with lower alpha, documents contain fewer topics.
Alpha is the hyper parameter for the Dirichlet prior. The Dirichlet prior is the distribution from which we draw theta. And theta becomes the parameter that decides what shape the topic distribution is. So essentially, alpha influences how we draw topic distributions.
Beta represents topic-word density - with a high beta, topics are made up of most of the words in the corpus, and with a low beta they consist of few words.

Perplexity and Topic Coherence

Coherence Parameters

“C_v” measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.

“C_p” is based on a sliding window, one-preceding segmentation of the top words and the confirmation measure of Fitelson’s coherence.
“C_uci” measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words.
“C_umass” is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure.
“C_npmi” is an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI).
“C_a” is baseed on a context window, a pairwise comparison of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.