TurkishFolkSongClassification

Turkish Folk Song Text Classifier

Preprocessing

The data is Turkish Folk Song that has Sivas, Erzurum, and, Erzincan regions. The type of data is text and also, samples of each region are unbalanced.
In Figure 1.0, Sivas, Erzurum, and Erzincan regions have 639, 514, 461 samples respectively.

Figure 1.0

To solve this problem, We used the resampling method. The purpose of this approach is to provide that each label has the same size. In the resampling method, there are 2 different approaches. These are under-sampling and over-sampling. Under-sampling removes samples from the majority class. Over-sampling adds more examples from minority class. Among these approaches, we preferred to use the over-sampling method. The reason why I did not choose the under-sampling method is that we do not have huge data, so when we remove samples from the dataset, important data can be deleted and we may encounter an under-fitting problem.
You can see the result in Figure 1.1. Each region has 639 samples.

Figure 1.1

Cleaning

After resampling, we cleaned the data in order to get rid of unnecessary information. In this process, text values are converted from uppercase to lowercase, removed new lines, punctuation, digits, and special characters. For Turkish stop words, we removed these words from data using the NLTK library. You can see more information about it in the ‘preprocessing.py’ file.

Feature Extraction

We preferred a Bag of Words (BOW) model that is one of the methods in natural language processing for feature extraction. In this model, the frequency of each word is calculated and it is used for training. BOW traverses all text in the data, figures out a set of words, and stores them as features. In python, scikit-learn’s CountVectorizer is a BOW implementation to convert a collection of text documents to a vector of term counts. In the project, we used this model with default settings and converted words to NumPy array.

Cross Validation

We split the data into the proportion of 80% train and 20% test. Also, each class sample is split with this rate. We used the train_test_split of the scikit-learn method for this task.

Classifications

We used five different machine learning algorithms to train and test the dataset. These algorithms are Random Forest, Naive Bayes, Decision Tree, K-Nearest Neighbor, and Support Vector Machine.
The purpose of using different machine learning algorithms is to compare the results and measure the performance of the model.
Naive Bayes has the highest accuracy when we compare other algorithms. The accuracy is 80.21.

Table 1.0

Classifications	Accuracy	Precision	Recall	F1
Naive Bayes	80.21	80.45	80.20	80.45
Random Forest	77.86	77.43	77.86	77.43
SVM	76.56	76.41	76.56	76.41
Decision Tree	76.3	76.10	76.3	76.10
KNN	51.56	47.80	51.56	57.80