项目作者: kaantas

项目描述 :
Counting Tweets Per User in Real-Time
高级语言: Python
项目地址: git://github.com/kaantas/kafka-twitter-spark-streaming.git
创建时间: 2017-07-18T13:01:33Z
项目社区:https://github.com/kaantas/kafka-twitter-spark-streaming

开源协议:

下载


Twitter and Spark Streaming with Apache Kafka

This project counts tweets that include #GoTS7 hashtag per user in real-time.

Also, username and tweet counts are printed.

Code Explanation

  1. Authentication operations were completed with Tweepy module of Python.
  2. StreamListener named KafkaPushListener was create for Twitter Streaming. StreamListener produces data for Kafka Consumer.
  3. Producing data was filtered about including Game of Thrones hashtag.
  4. SparkContext was created to connect Spark Cluster.
  5. Kafka Consumer that consumes data from ‘twitter’ topic was created.
  6. Calculated how many tweets include #GotS7 hashtag per user and printed usernames and counts in real-time.

Running

  1. Create Twitter API account and get keys for twitter_config.py
  2. Start Apache Kafka
    1. ./kafka/kafka_2.11-0.11.0.0/bin/kafka-server-start.sh ./kafka/kafka_2.11-0.11.0.0/config/server.properties
  3. Run kafka_push_listener.py with Python version 3.
    1. PYSPARK_PYTHON=python3 bin/spark-submit kafka_push_listener.py
  4. Run kafka_twitter_spark_streaming.py with Python version 3.
    1. PYSPARK_PYTHON=python3 bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 kafka_twitter_spark_streaming.py