项目作者: manoelhortaribeiro

项目描述 :
Characterization and detection of hateful users on Twitter.
高级语言: Jupyter Notebook
项目地址: git://github.com/manoelhortaribeiro/HatefulUsersTwitter.git
创建时间: 2017-10-20T13:59:47Z
项目社区:https://github.com/manoelhortaribeiro/HatefulUsersTwitter

开源协议:MIT License

下载


Hateful Users on Twitter

This folder contains the data and the analysis done in the paper:

  1. @inproceedings{ribeiro2018characterizing,
  2. title={Characterizing and Detecting Hateful Users on Twitter},
  3. author={Horta Ribeiro, Manoel and Calais, Pedro and
  4. Santos, Yuri and Almeida, Virg{\'\i}lio and Meira Jr, Wagner},
  5. booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
  6. year={2018}
  7. }

The experiments with the GraphSage algorithm are in another repository.

The dataset can be downloaded here on Kaggle.

Data and Reproducibility

This dataset contains a network of 100k users, out of which ~5k were annotated as hateful or not. For each user, several content-related, network-related and activity related features were provided. Some of the files used are not shared because sharing them violates Twitter’s guidelines.

You can download the following files here:

  • bad_words.txt list of bad words matched in the tweets.
  • lexicon.txt list of lexicon used in the diffusion method.

And the following files on Kaggle:

  • users_(hate|suspended)_(glove|all).content files with the feature vector for each user and their classes, the ones with hate label users as either hateful, normal or other, whereas the ones with suspended label users as either suspended or active. The ones with glove have only the glove vectors as features, the ones with all have other attributes related to users activity and network centrality. This is only for the GraphSage algorithm.

  • user.edges file with all the (directed) edges in the retweet graph.

  • users_clean.graphml networkx compatible file with retweet network. User id’s correspond to those in users_anon_neighborhood.csv!

  • users_anon_neighborhood.csv file with several features for each user as well as the avg for some features for their 1-neighborhood (ppl they tweeted). Notice that c_ are attributes calculated for the 1-neighborhood of a user in the retweet network (averaged out).

Attributes description

  1. hate :("hateful"|"normal"|"other")
  2. if user was annotated as hateful, normal, or not annotated.
  3. (is_50|is_50_2) :bool
  4. whether user was deleted up to 12/12/17 or 14/01/18.
  5. (is_63|is_63_2) :bool
  6. whether user was suspended up to 12/12/17 or 14/01/18.
  7. (hate|normal)_neigh :bool
  8. is the user on the neighborhood of a (hateful|normal) user?
  9. [c_] (statuses|follower|followees|favorites)_count :int
  10. number of (tweets|follower|followees|favorites) a user has.
  11. [c_] listed_count:int
  12. number of lists a user is in.
  13. [c_] (betweenness|eigenvector|in_degree|outdegree) :float
  14. centrality measurements for each user in the retweet graph.
  15. [c_] *_empath :float
  16. occurrences of empath categories in the users latest 200 tweets.
  17. [c_] *_glove :float
  18. glove vector calculated for users latest 200 tweets.
  19. [c_] (sentiment|subjectivity) :float
  20. average sentiment and subjectivity of users tweets.
  21. [c_] (time_diff|time_diff_median) :float
  22. average and median time difference between tweets.
  23. [c_] (tweet|retweet|quote) number :float
  24. percentage of direct tweets, retweets and quotes of an user.
  25. [c_] (number urls|number hashtags|baddies|mentions) :float
  26. number of bad words|mentions|urls|hashtags per tweet in average.
  27. [c_] status length :float
  28. average status length.
  29. hashtags :string
  30. all hashtags employed by the user separated by spaces.

Folder Structure

These are the main folders, reproducible with the dataset downloaded from Kaggle:

  • ./analysis/ contains the script exploring the dataset collected.

  • ./classification/ contains scripts with boosting classifier.

These folders are not reproducible, but they are present just in for completeness:

  • ./crawler/ contains the code used to extract the dataset. You need to set neo4j to run it.

  • ./prepreprocessing/ contains scripts to select the users to be annotated, and extract their tweets.

  • ./features/ contains scripts to get the features to be analyzed and that will be fed into the classifier.

Auxiliary folders:

  • ./data/ data generated by data wrangling.

  • ./secrets/ for the API/DB authentication stuff.

  • ./tmp/ auxiliary scripts.

  • ./img/ images generated by analyses.