Extract Textual insights from Video
This Code Pattern is part of the series Extracting Textual Insights from Videos with IBM Watson. Please complete the Extract audio from video, Build custom Speech to Text model with speaker diarization capabilities and Use advanced NLP and tone analysis to extract meaningful insights code patterns of the series before continuing further since all the code patterns are linked.
In a virtually connected world, staying focused towards work or education is very important. Studies suggests that most people tend to lose their focus from live virtual meetings or virtual classroom sessions post 20min, hence most of the meetings and virtual classrooms are recorded so that an individual can go through it later.
What if these recordings could be analyzed with the help of AI and a detailed report of the meeting or classroom could be generated? Towards this goal, in this code pattern, given a video recording of the virtual meeting or a virtual classroom, we will be extracting audio from video file using open source library FFMPEG, transcribing the audio to get speaker diarized notes with custom trained language and acoustic speech to text models, and generating a NLU report that consists of Category, Concepts, Emotion, Entities, Keywords, Sentiment, Top Positive Sentences and Word Clouds using Python Flask runtime.
In this code pattern, given any video, we will learn how to extract speaker diarized notes and meaningful insights report using Speech To Text, advanced NLP and Tone Analysis.
When you have completed this code pattern, you will understand how to:
User uploads recorded video file of the virtual meeting or a virtual classroom in the application.
FFMPG Library extracts audio from the video file.
Watson Speech To Text transcribes the audio to give a diarized textual output.
Watson Language Translator (Optionally) translates other languages into English transcript.
Watson Tone Analyzer analyses the transcript and picks up top positive statements form the transcript.
Watson Natural Language Understanding reads the transcript to identify key pointers from the transcript and get the sentiments and emotions.
The key pointers and summary of the video is then presented to the user in the application.
The user can then download the textual insights.
Clone the extract-textual-insights-from-video
repo locally. In a terminal, run:
$ git clone https://github.com/IBM/extract-textual-insights-from-video
You will have to add Watson Speech-To-Text, Tone Analyzer and Natural Language Understanding Credentials to the Application.
If you have completed the first three code patterns of the series, then you can add the same credentials created in second code pattern of the series and third code pattern of the series by following the steps below.
Or if you have landed on this code pattern directly without completing the previous code patterns of the series, you can add new credentials by following the steps below.
create
as shown.Services Credentials
New credential
and add a service credential as shown.Select a pricing
plan select Lite
and click on create
as shown.Select a pricing
plan select Lite
and click on create
as shown.bash
$ cd extract-textual-insights-from-video/
bash
$ docker image build -t extract-textual-insights-from-video .
bash
$ docker run -p 8080:8080 extract-textual-insights-from-video
bash
$ brew install ffmpeg
bash
$ cd extract-textual-insights-from-video/
python pip
to install the librariesbash
$ pip install -r requirements.txt
bash
$ python app.py
We’ll begin by uploading a video from which we’ll be extracting insights.
You can make use of any meeting video or classroom video that you have or you can download the video that we have used for the demonstration purpose.
This is a free educational video taken from cognitiveclasses.ai. The video is an introduction to a python course.
Click on the Drag and drop files here or click here to upload
, choose the video file you want to extract insights from.
Note: We have trained a custom language model and an acoustic model with
IBM Earnings Call Q1 2019
dataset. Hence the model’s performance will be best for Computer Science, Finance related Content. The model can be trained according to the content that you wish to extract. Example: Train the model with sports dataset to get best results with sports commentary.
Submit
button and wait for the application to process. When you have pressed submit, the application in background will:You can track the progress through the progress bar as shown.
The various progressing stages are:
NOTE: An approximate time to complete the extraction of insights will be displayed.
Speech To Text
tab as shown.NLU Analysis
tab to view the report.
More about the entities:
Category
- Categorize your content using a five-level classification hierarchy. View the complete list of categories here.Concept Tags
: Identify high-level concepts that aren’t necessarily directly referenced in the text.Entity
: Find people, places, events, and other types of entities mentioned in your content. View the complete list of entity types and subtypes here.Keywords
: Search your content for relevant keywords.Sentiments
: Analyze the sentiment toward specific target phrases and the sentiment of the document as a whole.Emotions
: Analyze emotion conveyed by specific target phrases or by the document as a whole.Positive sentences
: The Watson Tone Analyzer service uses linguistic analysis to detect emotional and language tones in written textLearn more features of:
- Watson Natural Language Understanding service. Learn more.
- Watson Tone Analyzer service. Learn more.
Once the NLU Analysis Report is generated you can review it. The Report consists of:
Features extracted by Watson Natural Language Understanding
Other features
Category
: Based on the dataset that we used, you can see that the category was extracted as technology and computing
specifically Software
. Note : You can see the confidence score of the model in green bubble tags.
Entity
: As you can see entity is Person
specifically Alex Ackles
indicating that, in the video recording most of the emphisis is given by a Person, Ackles
.
Concept Tags
: Top 3 concept tags are extracted from the video, United Nations
, Aesthetics
and Statistics
indicating that the speaker spoke about these contexts more often.
Keywords
, Sentiments
and Emotions
: Top keywords along with their sentiments and emotions are extracted, giving a sentiment analysis of the entire meeting.
Top Positive Sentences
: Based on emotional tone and language tone, positive sentences spoken in the video is extracted and is limited to 5 top positive sentences.
Word Clouds
: Based on the keywords, Nouns & Adjectives
as well as Verbs
are analyzed, and the result is then turned into word clouds.
print
button as shown.We learnt how to extract audio from video files, transcribe the audio with our custom built models, process the transcript to get speaker diarized notes as well as NLU analysis report.
This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.