项目作者: emanuelemorales

项目描述 :
Text mining techniques applied on Facebook comments and SMS spam detection.
高级语言: Jupyter Notebook
项目地址: git://github.com/emanuelemorales/TextMining.git
创建时间: 2021-03-13T10:30:50Z
项目社区:https://github.com/emanuelemorales/TextMining

开源协议:

下载


Text Mining Project

Text Analysis is becoming a fundamental tool in Data Science, because of the importance of parsing texts in order to extract machine-readable facts from them.

The goal of this text mining project is to accomplish three main tasks:

  • First Task - Data Cleaning and Pre-processing on Facebook comments:
  1. Removing punctuation and stop words;
  2. Tokenization of the text;
  3. Bi-grams;
  4. Split corpus in sentences;
  5. Bag of words;
  6. TF-IDF and document term matrix;
  7. Implementation with pipelines of the previous tasks.
  • Second Task - Classification, Clustering and Topic Model of SMS (Spam Detection):
  1. Classification with Logistic Regression;
  2. K-means Clustering;
  3. Topic Model using LDA (Latent Dirichlet Allocation);
  • Third Task - Summarization of a text:
  1. Application of TextRank algorithm to summarize a text from a WW2 TextBook.