How do we process data in different formats like docx, pdf etc and generate insights to be linked with structured data in database?This pattern helps in establishing relations between structured & unstructured data to generate recommendations using Watson NLU & Watson Studio.
In this code pattern, we will demonstrate a methodology to integrate structured data & unstructured data to generate recommendations. Processing unstructured data coming in different data formats has many challenges with respect to data extract & derive meaning to help us take informed decisions, however the related data would be in the structured format. It would be time consuming process to check different data sources manually for inference and that is where this pattern will be handy. We will showcase a configurable yet scalable process which will help in merging the different data sources and expedite the process of decision making. We have taken the example of HR recruitment process where we use the candidate’s resume to be compared with job description & candidate database to identify the best suited candidate for a given job profile. This will help the HR to develop an efficient recruitment plan. Our motto is to select the right candidate which helps in risk mitigation for the organization thereby enhancing the ROI and increases the credibility for the recruitment process. We will be using Watson Studio & Watson NLU to solve this use-case.
When the reader has completed this code pattern, they will understand how to:
The intended audience for this code pattern is developers who want to learn a new method for scanning the text across different document format and establish a relation with the data stored in the structured format in a database. The distinguishing factor of this code pattern is that it allows a configurable mechanism of search optimization which allows the recruiter to select the best fit candidate for the role.
Some of the other use cases where this methodology can be applied are listed below.
Drive optimization in the insurance domain by evaluating the information from the claim forms and new applicant forms.
By linking structured & unstructured data we can get recommendations for new leads, process improvements etc. Structured data will be in the form of product offering details & customer database where as the claim forms and new applicant forms will store unstructured data. How do we map them to extract insights, get recommendations, enhance efficiency, increase ROI & generate revenue are some of the highlights.
Enhance the efficiency of spare parts in the automobile industry to reduce warranty claims by taking the customer feedback after the service.
Customer feedback is captured in a form which is unstructured where the keywords are extracted from the problem statement and compared with the inventory system having structured information to check the quality and durability. If there are repeated complaints about specific sku’s then the manifacturing cycle of the specific sku’s should be reviewed for improvements which can enhance the durability of the spare parts and can reduce the warranty claims. This will also help the R & D team to come up with new features for the existing components to deliver superior performance which can enhance the customer experience resulting in increased sales.
Identifying the key parts causing the failure. Collecting failure descriptions to include:
Consolidate Warranty Systems & Processes. Minimize the number of automobile spare parts returns resulting in good inventory management.
IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.
Watson Natural Language Understanding: A IBM Cloud service that can analyze text to extract meta-data from content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles, using natural language understanding.
Follow these steps to setup and run this code pattern. The steps are
described in detail below.
Sign up for IBM Cloud. By clicking on create a free account you will get 30 days trial account.
Sign up for IBM’s Watson Studio.
Click on New project and select Data Science as per below.
Define the project by giving a Name and hit ‘Create’.
By creating a project in Watson Studio a free tier Object Storage
service will be created in your IBM Cloud account. Choose the storage type as Cloud Object Storage for this code pattern.
Create the following IBM Cloud service and name it wdc-NLU-service:
Create notebook
to create a notebook.Assets
tab, select the Create notebook
option.From URL
tab.Create
button.Clone this repo
Navigate to data.
Use Find and Add Data
(look for the 10/01
icon)
and its Files
tab. From there you can clickbrowse
and add data files from your computer. Insert the three files as specified in the notebook.
Note: The data files are in the data
directory
Select the cell below Add your service credentials from IBM Cloud for the Watson services
section in the notebook to update the credentials for Watson Natural Language Understanding.
Open the Watson Natural Language Understanding service in your IBM Cloud Dashboard and click on your service, which you should have named wdc-NLU-service
.
Once the service is open click the Service Credentials
menu on the left.
In the Service Credentials
that opens up in the UI, select whichever Credentials
you would like to use in the notebook from the KEY NAME
column. Click View credentials
and copy username
and password
key values that appear on the UI in JSON format.
Update the username
and password
key values in the cell below Add your service credentials from IBM Cloud for the Watson services
section.
Add your service credentials for Object Storage
section in the notebook to update the credentials for Object Store.Delete the contents of the cell
Use Find and Add Data
(look for the 10/01
icon) and its Files
tab. You should see the file names uploaded earlier. Make sure your active cell is the empty one below Add...
Insert to code
(below your sample_config.txt file).Insert Credentials
from drop down menu.credentials_1
.When a notebook is executed, what is actually happening is that each code cell in
the notebook is executed, in order, from top to bottom.
IMPORTANT: The first time you run your notebook, you will need to install the necessary
packages as mentioned in the notebook and thenRestart the kernel
.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
*
, this indicates that the cell is currently executing.There are several ways to execute the code cells in your notebook:
Play
button in the toolbar.Cell
menu bar, there are several options available. For example, youRun All
cells in your notebook, or you can Run All Below
, that willSchedule
button located in the top right section of your notebookWe can evaluate the output for different requirements which will help in taking informed decisions. For ex :- There’s a requirement for a user experience designer having 48 months experience and the query is passed onto our system which will showcase the results of two candidate profiles matching the criteria however we can see that candidate 1 has applied before and did not accept offer where as candidate 10 has not applied before and has a greater chance of accepting the offer if selected. HR can shortlist candidate 10 by taking informed decision as per the recommendation by our system which has internally analysed the CV and the candidate database.
The second example would be to query for candidate with good Machine Learning expertise and there are two candidates candidate 11 & candidate 14 fulfilling the requirement. Our system recommends candidate 11 because the candidate has master’s degree in statistics and will be a better fit for the role even though candidate 14 has more experience. We are adding inference by reviewing the aspects from the CV to identify the best fit candidate for each role.
To sum up, we have demonstrated one methodology using Watson Natural Language Understanding & Watson Studio to analyze structured & unstructured data to generate recommendations which can be used in different domains for multiple usecases.
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer [Certificate of Origin, Version 1.1 (DCO)] (https://developercertificate.org/) and the [Apache Software License, Version 2] (http://www.apache.org/licenses/LICENSE-2.0.txt).