项目作者: partoftheorigin
项目描述 :
Predict tags for posts from StackOverflow with multilabel classification approach.
高级语言: Jupyter Notebook
项目地址: git://github.com/partoftheorigin/multilabel-classification-stack-overflow.git
Predict tags for posts from StackOverflow with multilabel classification approach.
Dataset
- Dataset of post titles from StackOverflow
Transforming text to a vector
- Transformed text data to numeric vectors using bag-of-words and TF-IDF.
MultiLabel classifier
MultiLabelBinarizer to transform labels in a binary form and the prediction will be a mask of 0s and 1s.
Logistic Regression for Multilabel classification
- Coefficient = 10
- L2-regularization technique
Evaluation
Results evaluated using several classification metrics:
Libraries
- Numpy — a package for scientific computing.
- Pandas — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
- scikit-learn — a tool for data mining and data analysis.
- NLTK — a platform to work with natural language.