cassandra-spark-hdp3

This repo will guide you in installing and integrating Cassandra
with Spark2 on Hortonworks HDP 3.0.1. Skip installing HDP 3.0.1 if you already have.
alt text

Install Hortonworks HDP 3.0.1

Hortonworks Data Plateform (HDP) is a virtual machine (VM)
which can be used to play with Apache Hadoop. Other
popular technologies like Hive, Pig Latin, HBase, etc are
also installed. Follow the steps to install HDP 3.0.1
which is the latest version provided.

Head to cloudera
and chose the installation type Virtualbox. Then submit
the information required and download the HPD 3.0.1.
Beware, the file will take few minutes because file size is about 20 GB.
Download latest virtualbox
for Windows host.
Follow the steps and install the virtualbox.
Open virtualbox, (a) click File, (b) select Import Appliance,
(c) select the HDP file downloaded earlier,
(d) click Next and (e) click Import. Do Not change
any configuration for this VM, 4 VCPU and 10 GB of Base Memory.
The import process will take few minutes to complete.
Start the newly created VM from virtualbox. Open your
browser and go to 127.0.0.1:8080. If everything is working,
than you will see Ambari homepage. Login using user: raj_ops
and password: raj_ops. Check if Spark2 service is running or not.
If not than click on services and select restart essential services.

Install Cassandra

We will be installing Cassandra version 3.0.20. Running Cassandra will
take some extra resources so either add more Base Memory or
turn off some resources. If extra RAM is available then add memory to
virtualbox setting. The other option is to stop some services
which are not necessary for this application. Follow the steps
to remove extra services

Start VM and log in to Ambari using user: admin and
password: admin. Select Hive, Actions and Stop.
Do these steps with Ranger and Data Analytics Studio too
ssh into VM using ssh root@127.0.0.1 -p 2222
and password hadoop. Now, if this is the first login then it will
prompt to change the password. Create cassandra rpm
repo by creating a new file named cassandra.repo in
/etc/yum.repos.d and the following text;
```
 [cassandra]
 name=Apache Cassandra
 baseurl=https://downloads.apache.org/cassandra/redhat/30x/
 gpgcheck=1
 repo_gpgcheck=1
 gpgkey=https://downloads.apache.org/cassandra/KEYS
```
Cassandra requires Python 2 so check python version by running python --version. Run sudo yum install cassandra
and make sure that the package downloading is Cassandra 3.0.20.
Run cassandra, sudo service cassandra start

cqlsh is a SQL like shell for working with cassandra.
Run cqlsh command to enter in the shell and execute the following code.

CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
CREATE TABLE test.kv(key text PRIMARY KEY, value int);
INSERT INTO test.kv(key, value) VALUES ('key1', 1);
INSERT INTO test.kv(key, value) VALUES ('key2', 2);

This will create a table kv in keyspace test and
insert two entries in it. We will retrieve this table
from Spark2

Integrate Cassandra and Spark2

Login HDP using ssh with ssh maria_dev@127.0.0.1 -p 2222
and password: maria_dev. Enter the following command to run
spark-shell with spark-cassandra-connector.
```shell script
spark-shell —conf spark.cassandra.connection.host=127.0.0.1 \
—packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.1,commons-configuration1.6


Now we will retrieve dummy values we inserted earlier using `cqlsh`
```python
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)

alt text

Retrieve data in PySpark

Create a new python script, CassandraSpark2.py

from pyspark.sql import SparkSession
from pyspark.sql import Row
if __name__ == "__main__":
    spark = SparkSession.builder.appName("CassandraApp").getOrCreate()
    # read data from cassandra
    readData = spark.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="test").load()
    # print table
    readData.show()
    # insert data 
    tmp_df = spark.createDataFrame([("key3", 3), ("key4", 4)], ["key", "value"])
    tmp_df.write.format("org.apache.spark.sql.cassandra")\
    .mode("append").options(table="kv", keyspace="test").save()
    # read new data
    readData = spark.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="test").load()
    readData.show()
    # end spark session
    spark.stop()

Run the above script using following command;

spark-submit --conf spark.cassandra.connection.host=127.0.0.1 \ 
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.1,commons-configuration:commons-configuration:1.6 \
CassandraSpark2.py

alt text