项目作者: sihanh

项目描述 :
Adversarial attack on spam filter
高级语言: Python
项目地址: git://github.com/sihanh/Spamfilter1.0.git
创建时间: 2019-10-16T05:34:47Z
项目社区:https://github.com/sihanh/Spamfilter1.0

开源协议:

下载


Spamfilter1.0

Naive spam filter using sklearn

Spam Classifier

Naive Bayes Spam Filter

  • Dependencies: sklearn for classifier frame,numpy,pandas for data processing.

  • The classifier training procedure is in spam_sky.py,running this script to train the classifier and view the result.

  • After finishing running, use Function nb_predict(message),return_type = str in python console to evaluate a given message in type: str to classify the message. For example,

    1. In[1]: nb_predict('I will come by your house tonight after dinner, we can hangout somewhere')
    2. The message tends to be ham
    3. Out[1]: 'ham'
  • Example 2

  1. In[2]: nb_predict('Your chance to win 500$ prize! Claim today')
  2. The message tends to be spam
  3. Out[2]: 'spam'

Maximum Entropy spam fiilter(Maxent filter)

  • Depedencies: nltk, pandas, numpy, collections

  • The classifier training procedure is in spam_maxent_api.py,running this script to train the classifier and view the result. The parameter max_iter will somehow determine the performance of Maxent Classifier and the larger of the max_iter, it will take longer time to train. You can view the trace of training during the running of the script.

  • Use function ma_predict(message),return_type = str in python console to evaluate a given message in type: str to classify the message.

  • The results are kind of same as the those from previous examples
    ```python
    In[4]: ma_predict(‘I will come by your house tonight after dinner, we can hangout somewhere’)

    1. The input message is tend to be ham

    Out[4]: ‘ham’


In[5]: ma_predict(‘Your chance to win 500$ prize! Claim today’)
The input message is tend to be spam
Out[5]: ‘spam’

  1. ## Good Word Attack
  2. * Run `good_word_attack.py` to implement active good word attack. Two word lists are generated.
  3. * First-n-word list was obtained using `CountVectorizer.vocabulary_.items()` from `sklearn.feature_extraction.text`, which gives the n most presented words.
  4. * Best-n-word list was obtained using `show_most_informative_features(20, show='ham')` from `nltk.classify.MaxentClassifier`, which gives the n words has most impact.
  5. * The script `good_word_attack.py` will automatically run `spam_sky.py` and `spam_maxent_api.py`. Functions from previous sections are also available to use.
  6. * Function `word_attack_NB(word_list, message)`, `(word_list = best_n_word or first_n_word)` return the minimum numbers of good words needed from ***word_list*** to turn a spam mail into ham mail using Naive Bayes Classifier. The classifier is more vulnerable if fewer words are needed to break through. This method was proposed by **Christopher Meek** in 2005.
  7. * Function `word_attack_MA(word_list, message)` runs similarly for Maxent filter.
  8. * Example
  9. ## NB Reiew Classifier
  10. * apply NB classifiers on yelp review
  11. * similar function usage as before
  12. * `NBReviewClassifier.py`
  13. ## Maxent Review Classifier
  14. * apply MA classifiers on yelp review
  15. * similar function usage as before
  16. * Run `Review_maxent.py`
  17. * This Maxent Filter takes about 1 hour to train. Time variance apply on different machines.
  18. ## Good word review test
  19. * Evaluate different parameters
  20. * Run `good_word_review_test.py`
  21. * n1, n2, n3 .... n8 to check parameters under different words adding strategy.
  22. ```python
  23. In[3]: word_attack_NB(first_n_word,'Please call our customer service representative on 0800 169 6031 between 10am-9pm as you have WON a guaranteed a £1000 cash or a £5000 prize')
  24. Out[3]: 2383