项目作者: emanuelemorales
项目描述 :
Text mining techniques applied on Facebook comments and SMS spam detection.
高级语言: Jupyter Notebook
项目地址: git://github.com/emanuelemorales/TextMining.git
Text Mining Project
Text Analysis is becoming a fundamental tool in Data Science, because of the importance of parsing texts in order to extract machine-readable facts from them.
The goal of this text mining project is to accomplish three main tasks:
- First Task - Data Cleaning and Pre-processing on Facebook comments:
- Removing punctuation and stop words;
- Tokenization of the text;
- Bi-grams;
- Split corpus in sentences;
- Bag of words;
- TF-IDF and document term matrix;
- Implementation with pipelines of the previous tasks.
- Second Task - Classification, Clustering and Topic Model of SMS (Spam Detection):
- Classification with Logistic Regression;
- K-means Clustering;
- Topic Model using LDA (Latent Dirichlet Allocation);
- Third Task - Summarization of a text:
- Application of TextRank algorithm to summarize a text from a WW2 TextBook.