项目作者: partoftheorigin

项目描述 :
Predict tags for posts from StackOverflow with multilabel classification approach.
高级语言: Jupyter Notebook
项目地址: git://github.com/partoftheorigin/multilabel-classification-stack-overflow.git


Multilabel classification on Stack Overflow tags

Predict tags for posts from StackOverflow with multilabel classification approach.

Dataset

  • Dataset of post titles from StackOverflow

Transforming text to a vector

  • Transformed text data to numeric vectors using bag-of-words and TF-IDF.

MultiLabel classifier

MultiLabelBinarizer to transform labels in a binary form and the prediction will be a mask of 0s and 1s.

Logistic Regression for Multilabel classification

  • Coefficient = 10
  • L2-regularization technique

Evaluation

Results evaluated using several classification metrics:

Libraries

  • Numpy — a package for scientific computing.
  • Pandas — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
  • scikit-learn — a tool for data mining and data analysis.
  • NLTK — a platform to work with natural language.