This project is made to classify sentiments in IMDB movie reviews. 🎞🎟🎫🎭📽🎬📺☹😊
This project is made to classify sentiments in IMDB movie reviews.
Step 1: Data Preprocessing
(a) Loading the Data
Call imdb.load_data() function for the imdb reviews dataset.
(b) Converting the Raw Labels into Categorical Vectors
We convert the raw labels ie. y_train and y_test to categorical vectors
(c) Padding the Sequences to Fixed length
Padding is a form of Tokenization of words to fixed length in sequences.
Here we pad the sequences of text data of a fixed length of 120 integers.
So that every emotion can be tokenized later.
Tokenization
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
Step 2: Defining and Compiling the Model
Define the Hyperpararmeters for our LSTM model and compile it.
Categorical Cross-Entropy loss
Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss. If we use this loss, we will train a CNN to output a probability over the C classes for each image. It is used for multi-class classification.
In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class Cp keeps its term in the loss. There is only one element of the Target vector t which is not zero ti=tp. So discarding the elements of the summation which are zero due to target labels, we can write:
Adam Optimizer
Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
Step 3: Training the Model
Now train the above model over the training dataset with a batch size of 1000 samples.
References