Get insights from OrientDB database using PyOrient through IBM Watson Studio
Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.
This journey gives you a head start on how to work with graphs in OrientDB through IBM Watson Studio using PyOrient module - a python driver for OrientDB to operate on data and to get insights from OrientDB. IBM Watson Studio can be used to analyze data using Jupyter notebooks.
OrientDB is a multi-model database, supporting graph, document, key/value, and object models, but the relationships are managed as in graph databases with direct connections between records. Graph databases are well-suited for analysing interconnections like to mine data from social media. It is also useful for working with data in business disciplines that involve complex relationships and dynamic schema and creating recommendations like “customers who bought this also looked at…”. This journey will help you to understand end-to-end flow starting from downloading the data-set, cleansing of data, extract entities and relations from the data-set, connect with OrientDB, create a new OrientDB database, populate database with node classes, edge classes, vertices, relations and then execute queries to get insights from the data in OrientDB database. OrientDB have extended SQL to provide support for graph traversal in graph database making it easy for developers familiar with SQL to start exploring graph database for their business needs.
In this journey we will demonstrate:
To achieve this, OrientDB instance is created on the Kubernetes Cluster and then it is accessed through IBM Watson Studio. This journey will help developers to get started with various OrientDB operations like CRUD, basic traversal and extracting insights using PyOrient on IBM Watson Studio.
When the reader has completed this journey, they will understand how to:
OrientDB: A Multi-Model Open Source NoSQL DBMS.
IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.
Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Kubernetes Clusters: an open-source system for automating deployment, scaling, and management of containerized applications.
Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
Graph Database: A graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with one operation.
Create a Kubernetes cluster with IBM Cloud Container Service to deploy in cloud. Deploy OrientDB on Kubernetes Cluster using Deploy OrientDB on Kubernetes.
Watch this video to get an overview of this developer Journey.
Follow these steps to setup and run this developer journey. The steps are
described in detail below.
Deploy OrientDB on Kubernetes cluster using Deploy OrientDB on Kubernetes. It will expose the ports on IBM Cloud through which OrientDB can be accessed from the Jupyter notebook on IBM Watson Studio. Use the ip-address of your cluster
and node port port 2424
on which the OrientDB console is mapped, to access that OrientDB through Jupyter notebook.
Sign up for IBM’s Watson Studio. By creating a project in Watson Studio a free tier Object Storage
service will be created in your IBM Cloud account.
Create notebook
to create a notebook.Assets
tab, select the Create notebook
option.From URL
tab.Create
button.config.json
configuration file to Object storage, make sure you update the config file with1. Deploy OrientDB on Kubernetes Cluster
Graphdb-Insights.csv
Find and Add Data
(look for the 10/01
icon)Files
tab.browse
and navigate to Graphdb-Insights.csv
on your computer.config.json
Watson Studio configuration file to Object storage from URL:Find and Add Data
(look for the 10/01
icon) and its Files
tab. You should see the file names uploaded earlier. Make sure your active cell is the empty one created earlier.Insert to code
below config.json and click insert credentials from the dropdown. Please rename the variable to credentials_1
if the name is different.3. Add your service credentials for Object Storage
section in the notebook to update the credentials for Object Store.Insert to code
below Graphdb-Insights.csv(movie dataset) and click Insert Pandas Dataframe from the dropdown in the empty cell below 4.2. Loading the IMDb movie data
.The notebook has been divided into various sections with each section performing a specific task on the OrientDB.
Setup
which deals with the installation of the OrientDB, importing the packages and libraries, adding the credentials of the files from object storage and loading them in the notebook for use.Utility Functions and Core functions
The notebook creates a graph with two node classes- person
class and movie
class. With person class as its attributes as: name
, fblikes
, role(actor/ director)
and movie class as its attributes as: title
, year
, durationInMins
, imdbRating
, genre
, plotKeywords
, numCriticForReviews
, movieFacebookLikes
. There are two types of relationships involved in connecting the nodes, one is worked_with
, which is between the two person nodes who have worked togther in the same movie and another one is acted_in
, which between a person node and movie node for a person who have acted in a particular movie. The utility functions are written to keep a check on the duplicacy as IF NOT EXISTS
is only valid for creating the properties in the OrientDB. Unlike in SQL, IF NOT EXISTS
doesn’t work with create class
or insert
statements in OrientDB. The core functions are for creating database, creating graph as discussed, and get insights from the graph created.Insights and Visualization
which focuses on performing various operations on and get insights from the OrientDB database.When a notebook is executed, what is actually happening is that each code cell in
the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
*
, this indicates that the cell is currently executing.There are several ways to execute the code cells in your notebook:
Play
button in the toolbar.Cell
menu bar, there are several options available. For example, youRun All
cells in your notebook, or you can Run All Below
, that willSchedule
button located in the top right section of your notebookFor this Notebook, to run every cell one by one is recommended so as to understand the flow of the notebook and also to comprehend the operation performed by each cell on OrientDB better.
The notebook uses two use cases to demonstrate how to get insights from the OrientDB like the most mentioned movie
and the clustering of the movies with IMDb rating greater than 7
. Each insight has its own function in the notebook. Check the cell Core Functions
in notebook, you will find the functions for the same. Call those functions to get the results. The following image shows the functions and its results.
OrientDB also provides an interactive dashboard OrientDB studio for visualization of the graph and to view the results of the queries. You can run the queries in the browse section of the OrientDB studio to get the desired insights or to create the node and Edges. The same two queries which the notebook uses i.e. to get the most mentioned movie and the clustering of the movies with IMDb rating greater than 7
can be executed in the browse section of the OrientDB to analyze the results, check the screenshot of the OrientDB Studio below for the same. The results of the query executed are available in the form of table and JSON. And the results can also be downloaded as CSV for further analysis.
cluster the movies with IMDb rating greater than 7
and view the results in table formatmost_mentioned
movie and view results in the form of the tablemost_mentioned
and view the results in the json formatTo visualize the graph created by using the functions written in the notebook,
to find the coworkers of the actor Tom Hanks
.This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.