我正在研究情绪分析问题,数据看起来像这样:
label instances 5 1190 4 838 3 239 1 204 2 127
所以,我的数据是不平衡的,因为1190 instances标有5。对于使用scikit的SVC进行的分类Im 。问题是我不知道如何以正确的方式平衡我的数据,以便准确计算多类案例的精度,查全率,准确性和f1得分。因此,我尝试了以下方法:
第一:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10}) wclf.fit(X, y) weighted_prediction = wclf.predict(X_test) print 'Accuracy:', accuracy_score(y_test, weighted_prediction) print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted') print 'Recall:', recall_score(y_test, weighted_prediction, average='weighted') print 'Precision:', precision_score(y_test, weighted_prediction, average='weighted') print '\n clasification report:\n', classification_report(y_test, weighted_prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
第二:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto') auto_wclf.fit(X, y) auto_weighted_prediction = auto_wclf.predict(X_test) print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction) print 'F1 score:', f1_score(y_test, auto_weighted_prediction, average='weighted') print 'Recall:', recall_score(y_test, auto_weighted_prediction, average='weighted') print 'Precision:', precision_score(y_test, auto_weighted_prediction, average='weighted') print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
第三:
clf = SVC(kernel='linear', C= 1) clf.fit(X, y) prediction = clf.predict(X_test) from sklearn.metrics import precision_score, \ recall_score, confusion_matrix, classification_report, \ accuracy_score, f1_score print 'Accuracy:', accuracy_score(y_test, prediction) print 'F1 score:', f1_score(y_test, prediction) print 'Recall:', recall_score(y_test, prediction) print 'Precision:', precision_score(y_test, prediction) print '\n clasification report:\n', classification_report(y_test,prediction) print '\n confussion matrix:\n',confusion_matrix(y_test, prediction) F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". sample_weight=sample_weight) /usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". sample_weight=sample_weight) /usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". sample_weight=sample_weight) 0.930416613529
但是,我收到这样的警告:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1"
如何正确处理我的不平衡数据,以便以正确的方式计算分类器的指标?