项目作者: identitymonk

项目描述 :
Export Tweets from Twitter into JSON file then publish as a Graph objects in Neo4j DB
高级语言: Python
项目地址: git://github.com/identitymonk/ConferenceTweetMapper.git
创建时间: 2018-10-17T04:39:51Z
项目社区:https://github.com/identitymonk/ConferenceTweetMapper

开源协议:

下载


ConferenceTweetMapper

General usage

This repository contains three Python scripts:

  • archive.py

    This script allows to search Twitter for a particular conference hashtag and to return all the resulting tweets in a JSON file

  • graphify.py

    This script takes a JSON file containing Tweets and to transform them to graph oriented object representing another view of the timeline of Tweets, Retweets, Quote, Hashtag, and Handles. In case of Retweet/Quote/Reply, this script will also drills for original tweet even if outside the scope of the file.

  • stream.py

    This script allows to search Twitter for a particular conference hashtag and to transform them to graph oriented object representing another view of the timeline of Tweets, Retweets, Quote, Hashtag, and Handles. In case of Retweet/Quote/Reply, this script will also drills for original tweet even if outside the scope of the search filter.

Pre-requisites

In order to use those scripts you must have:

  • Python

    • 2.7: All not in the Python 3.X list below
    • 3.7: For graphify.py, archive.py
  • Pip installed

  • Python packages installed:

    • tweepy
    • json
    • time
    • configparser
    • argparse
    • py2neo
    • asyncio
      • requires Python 3.X to work
  • Neo4j Db installed, configured, and ready for connection

    I won’t detail here how to do this part, there are plenty of good tutorials on the Web

archive.py

This script can be used as follow:

  1. usage: archive.py [-h] {file,line} ...
  2. Export tweets that match the search query
  3. positional arguments:
  4. {file,line} Add configuration from Ini file or through arguments
  5. file Adding configuration from a file (Default: Ini/Default.ini)
  6. line Adding configuration from a arguments in the command line
  7. optional arguments:
  8. -h, --help show this help message and exit

The file subcommand supports the following syntax:

  1. usage: stream.py file [-h] [-i INI_FILE]
  2. positional arguments:
  3. cmd
  4. optional arguments:
  5. -h, --help show this help message and exit
  6. -i INI_FILE, --ini_file INI_FILE
  7. Path to the Ini file (Default: Ini/Default.ini)

The line subcommand supports the following syntax:

  1. usage: stream.py line [-h] -s SEARCH -ck CONSUMER_KEY -cs CONSUMER_SECRET -ak
  2. ACCESS_KEY -as ACCESS_SECRET -o OUTPUT_FILENAME
  3. [-b BACKUP_INI_FILE_NAME]
  4. positional arguments:
  5. cmd
  6. optional arguments:
  7. -h, --help show this help message and exit
  8. -s SEARCH, --search SEARCH
  9. Twitter search filter
  10. -ck CONSUMER_KEY, --consumer_key CONSUMER_KEY
  11. Twitter consumer key obtained from your Twitter account
  12. -cs CONSUMER_SECRET, --consumer_secret CONSUMER_SECRET
  13. Twitter consumer secret obtained from your Twitter account
  14. -ak ACCESS_KEY, --access_key ACCESS_KEY
  15. Twitter access key obtained from your Twitter account
  16. -as ACCESS_SECRET, --access_secret ACCESS_SECRET
  17. Twitter access_secret obtained from your Twitter account
  18. -o OUTPUT_FILENAME, --output_filename OUTPUT_FILENAME
  19. Name of the results output file
  20. -b BACKUP_INI_FILE_NAME, --backup_ini_file_name BACKUP_INI_FILE_NAME
  21. Name of the Ini file to backup from this request parameters

graphify.py

This script can be used as follow:

  1. usage: graphify.py [-h] {file,line} ...
  2. Import tweets in a Graph DB
  3. positional arguments:
  4. {file,line} Add configuration from Ini file or through arguments
  5. file Adding configuration from a file (Default: Ini/Default.ini)
  6. line Adding configuration from a arguments in the command line
  7. optional arguments:
  8. -h, --help show this help message and exit

The file subcommand supports the following syntax:

  1. usage: graphify.py file [-h] [-i INI_FILE]
  2. positional arguments:
  3. cmd
  4. optional arguments:
  5. -h, --help show this help message and exit
  6. -i INI_FILE, --ini_file INI_FILE
  7. Path to the Ini file (Default: Ini/Default.ini)

The line subcommand supports the following syntax:

  1. usage: graphify.py line [-h] [-type DB_TYPE] [-proto PROTOCOL]
  2. [-lang LANGUAGE] [-server SERVER_NAME]
  3. [-port SERVER_PORT] -pwd DB_PASSWORD [-set RESULT_SET]
  4. -name CONFERENCE_NAME -loc CONFERENCE_LOCATION -time
  5. CONFERENCE_TIME_ZONE -start CONFERENCE_START_DATE -end
  6. CONFERENCE_END_DATE [-purge PURGE_BEFORE_IMPORT]
  7. [-fname FILTER_ORGANIZER_TWITTER_SCREENAME]
  8. [-fhash FILTER_CONFERENCE_HASHTAG]
  9. [-b BACKUP_INI_FILE_NAME]
  10. positional arguments:
  11. cmd
  12. optional arguments:
  13. -h, --help show this help message and exit
  14. -type DB_TYPE, --db_type DB_TYPE
  15. For future use: indicate db type
  16. -proto PROTOCOL, --protocol PROTOCOL
  17. For future use: indicate protocol to connect to db
  18. -lang LANGUAGE, --language LANGUAGE
  19. For future use: indicate language to query the db
  20. -sec SECURE, --secure SECURE
  21. Flag for secure connection
  22. -server SERVER_NAME, --server_name SERVER_NAME
  23. FQDN of the db server
  24. -port SERVER_PORT, --server_port SERVER_PORT
  25. server socket hosting the db service
  26. -pwd DB_PASSWORD, --db_password DB_PASSWORD
  27. service password to access the db
  28. -set RESULT_SET, --result_set RESULT_SET
  29. Result set file from streaming script (Default: Output/search.json)
  30. -name CONFERENCE_NAME, --conference_name CONFERENCE_NAME
  31. Name of the conference for the master node
  32. -loc CONFERENCE_LOCATION, --conference_location CONFERENCE_LOCATION
  33. Location of the conference for the master node
  34. -time CONFERENCE_TIME_ZONE, --conference_time_zone CONFERENCE_TIME_ZONE
  35. Number of (+/-) hours from UTC reference of the conference's timezone
  36. -start CONFERENCE_START_DATE, --conference_start_date CONFERENCE_START_DATE
  37. First day of the conference in dd/mm/yyyy format
  38. -end CONFERENCE_END_DATE, --conference_end_date CONFERENCE_END_DATE
  39. Last day of the conference in dd/mm/yyyy format
  40. -purge PURGE_BEFORE_IMPORT, --purge_before_import PURGE_BEFORE_IMPORT
  41. Indicate if the graph must be deleted before importing (Default: false)
  42. -fname FILTER_ORGANIZER_TWITTER_SCREENAME, --filter_organizer_twitter_screename FILTER_ORGANIZER_TWITTER_SCREENAME
  43. Twitter screename that helps to filter out organizer tweets and retweets
  44. -fhash FILTER_CONFERENCE_HASHTAG, --filter_conference_hashtag FILTER_CONFERENCE_HASHTAG
  45. Hashtag of the conference
  46. -b BACKUP_INI_FILE_NAME, --backup_ini_file_name BACKUP_INI_FILE_NAME
  47. Name of the Ini file to backup from this request parameters

stream.py

This script can be used as follow:

  1. usage: stream.py [-h] {file,line} ...
  2. Export tweets that match the search query
  3. positional arguments:
  4. {file,line} Add configuration from Ini file or through arguments
  5. file Adding configuration from a file (Default: Ini/Default.ini)
  6. line Adding configuration from a arguments in the command line
  7. optional arguments:
  8. -h, --help show this help message and exit

The file subcommand supports the following syntax:

  1. usage: stream.py file [-h] [-i INI_FILE]
  2. positional arguments:
  3. cmd
  4. optional arguments:
  5. -h, --help show this help message and exit
  6. -i INI_FILE, --ini_file INI_FILE
  7. Path to the Ini file (Default: Ini/Default.ini)

The line subcommand supports the following syntax:

  1. usage: stream.py file [-h] [-i INI_FILE]
  2. positional arguments:
  3. cmd
  4. optional arguments:
  5. -h, --help show this help message and exit
  6. -i INI_FILE, --ini_file INI_FILE
  7. Path to the Ini file (Default: Ini/Default.ini)
  8. (base) C:\Users\User\Documents\Git\Work\ConferenceTweetMapper>python stream.py line -h
  9. usage: stream.py line [-h] -s SEARCH -ck CONSUMER_KEY -cs CONSUMER_SECRET -ak
  10. ACCESS_KEY -as ACCESS_SECRET -o OUTPUT_FILENAME
  11. [-type DB_TYPE] [-proto PROTOCOL] [-lang LANGUAGE]
  12. [-server SERVER_NAME] [-port SERVER_PORT] -pwd
  13. DB_PASSWORD [-set RESULT_SET] -name CONFERENCE_NAME -loc
  14. CONFERENCE_LOCATION -time CONFERENCE_TIME_ZONE -start
  15. CONFERENCE_START_DATE -end CONFERENCE_END_DATE
  16. [-purge PURGE_BEFORE_IMPORT]
  17. [-fname FILTER_ORGANIZER_TWITTER_SCREENAME]
  18. [-fhash FILTER_CONFERENCE_HASHTAG]
  19. [-b BACKUP_INI_FILE_NAME]
  20. positional arguments:
  21. cmd
  22. optional arguments:
  23. -h, --help show this help message and exit
  24. -s SEARCH, --search SEARCH
  25. Twitter search filter
  26. -ck CONSUMER_KEY, --consumer_key CONSUMER_KEY
  27. Twitter consumer key obtained from your Twitter account
  28. -cs CONSUMER_SECRET, --consumer_secret CONSUMER_SECRET
  29. Twitter consumer secret obtained from your Twitter account
  30. -ak ACCESS_KEY, --access_key ACCESS_KEY
  31. Twitter access key obtained from your Twitter account
  32. -as ACCESS_SECRET, --access_secret ACCESS_SECRET
  33. Twitter access_secret obtained from your Twitter account
  34. -o OUTPUT_FILENAME, --output_filename OUTPUT_FILENAME
  35. Name of the results output file
  36. -type DB_TYPE, --db_type DB_TYPE
  37. For future use: indicate db type
  38. -proto PROTOCOL, --protocol PROTOCOL
  39. For future use: indicate protocol to connect to db
  40. -lang LANGUAGE, --language LANGUAGE
  41. For future use: indicate language to query the db
  42. -sec SECURE, --secure SECURE
  43. Flag for secure connection
  44. -server SERVER_NAME, --server_name SERVER_NAME
  45. FQDN of the db server
  46. -port SERVER_PORT, --server_port SERVER_PORT
  47. server socket hosting the db service
  48. -pwd DB_PASSWORD, --db_password DB_PASSWORD
  49. service password to access the db
  50. -set RESULT_SET, --result_set RESULT_SET
  51. Result set file from streaming script (Default: Output/search.json)
  52. -name CONFERENCE_NAME, --conference_name CONFERENCE_NAME
  53. Name of the conference for the master node
  54. -loc CONFERENCE_LOCATION, --conference_location CONFERENCE_LOCATION
  55. Location of the conference for the master node
  56. -time CONFERENCE_TIME_ZONE, --conference_time_zone CONFERENCE_TIME_ZONE
  57. Number of (+/-) hours from UTC reference of the conference's timezone
  58. -start CONFERENCE_START_DATE, --conference_start_date CONFERENCE_START_DATE
  59. First day of the conference in dd/mm/yyyy format
  60. -end CONFERENCE_END_DATE, --conference_end_date CONFERENCE_END_DATE
  61. Last day of the conference in dd/mm/yyyy format
  62. -purge PURGE_BEFORE_IMPORT, --purge_before_import PURGE_BEFORE_IMPORT
  63. Indicate if the graph must be deleted before importing (Default: false)
  64. -fname FILTER_ORGANIZER_TWITTER_SCREENAME, --filter_organizer_twitter_screename FILTER_ORGANIZER_TWITTER_SCREENAME
  65. Twitter screename that helps to filter out organizer tweets and retweets
  66. -fhash FILTER_CONFERENCE_HASHTAG, --filter_conference_hashtag FILTER_CONFERENCE_HASHTAG
  67. Hashtag of the conference
  68. -b BACKUP_INI_FILE_NAME, --backup_ini_file_name BACKUP_INI_FILE_NAME
  69. Name of the Ini file to backup from this request parameters

Ini file example

In the Ini folder you should find a Default.ini file describing the format expected for a global Ini file:

  1. #Default initialization filter
  2. #All dates shall be in format dd/mm/yyyy
  3. [DEFAULT]
  4. output_filename = Output/search.json
  5. search = #Identiverse
  6. [Twitter]
  7. consumer_key = <your_consumer_key>
  8. consumer_secret = <your_consumer_secret>
  9. access_key = <your_access_key>
  10. access_secret = <your_access_secret>
  11. [Graph]
  12. db_type = Neo4j
  13. protocol = bolt
  14. language = cypher
  15. server_name = localhost
  16. server_port = 7687
  17. db_password = Identiverse
  18. [Processing]
  19. result_set = Output/search.json
  20. conference_name = Identiverse 2018
  21. conference_location = Boston
  22. conference_time_zone = -4
  23. conference_start_date = 24/06/2018
  24. conference_end_date = 27/06/2018
  25. [Misc]
  26. purge_before_import = false
  27. filter_organizer_twitter_screename = Identiverse
  28. filter_conference_hashtag = Identiverse

General limitations and advices

Using those scripts, you understand that:

  • Having two scripts allows to separate the two operations independently
  • Scripts do not check for file existence at the time of exporting (results and configuration), so be careful if you don’t want one to be overwritten
  • Twitter search public API will not return unindexed results, some results older than 7 days, or maybe all the results you may get by using the UI version of it
  • stream.py search filter aims has been designed to target conference hashtag… but it is a standard Twitter search filter supporting all the options Twitter allows
  • graphify.py does only support Neo4j, bolt protocol, and cipher language as for now

Example of the result

If successful you should be able to use Neo4j tools to visualize and drill your Tweet Graph:
Tweet Graph

Example of the drilling of a Retweet/Quote/Reply:
Drilling

Here are some interesting Cipher request examples

Roadmap

  • Generates statistics (1rst level Tweeters, 1rst level Tweets, engaged Twitters, engaged Tweets, etc.)
  • Follow RT, Reply, Quote up and down a la treeverse <- Partially solved, will need script Expand
  • Script Redox: Merge similar RT into only one RT-Tweet
  • Script Expand: Import all the retweets by retweets_of_status_id and replies by in_reply_to_status_id Prenium Search parameters
  • Script Append: Continue an import or update an import with a list of tweets. Look before if tweet is alredy imported or not.
  • Switch script function to Async https://www.aeracode.org/2018/02/19/python-async-simplified/
  • Think about KPIs: Tweet rate, Top for User/Tweet/Hashtag/Mention (see Generates statistics)
  • WebUI to see Graph online
  • Update logging to console to be more dynamic
  • Better date management
  • Change the Post and Pre conference period id to something speciifc to the conference upload to prevent cross mapping
  • Change the Days of conference period id to something speciifc to the conference upload to prevent cross mapping
  • Correct name attribue of object Source to remove href