Acoustic sentiment analysis for emotion classification
I. Aim
To analyse sentiments based on acoustic features and to classify the sentiments into 10 classes.
II. Classes
1) Female angry
2) Female calm
3) Feamle fearful
4) Female happy
5) Female sad
6) Male angry
7) Male calm
8) Male fearful
9) Male happy
10) Male sad
III. DataSet
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). Out of speech and song we have used speech dataset. The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. We have used Audio-only (16bit,48kHz) data. Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440. Out of 8 emotions we have chosen calm, happy, sad, angry, fearful emotions from dataset for classification i.e. 960 samples.
IV. Speech file loading parameters
Sampling rate : 44.1 Khz
Speech file duration : 2.5 seconds
Hop length : 512
Number of frames : (44100*2.5) / 512 = 216 frames
We have experimented with different hop lengths and sampling rates. The chosen hop length and sampling rate gives good accuracy.
V. Requirements
python | tensorflow | librosa | matplotlib | keras | sklearn
VI. Feature
MFCC : Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope.
Steps to find MFCC:
VII. Exploratory Data Analysis (EDA)
1) Waveform plot for speech sample
2) Scaled MFCC (13x216)
Number of nMFCC coefficients=13
Number of frames=216
3)CNN result
We have also tested MFCC with MLP and LSTM but CNN gave better performance than both of them.
Below Flowchart shows the overall flow of EDA and use cases presented to customer