Sentiment Analysis with Pytorch — Part 4 — LSTM\BiLSTM Model

Gal Hever
8 min readApr 11, 2020

--

Introduction

This post is the forth part of the serie — Sentiment Analysis with Pytorch. In the previous parts we learned how to work with TorchText and we built Linear and CNN models. The full code of this tutorial is available here.

In this blog-post we will focus on modeling and training LSTM\BiLSTM architectures with Pytorch.

If you wish to continue to the next part in the serie:

Sentiment Analysis with Pytorch — Part 5 — MLP Model

What is LSTM Model?

Long Short-Term Memory (LSTM) networks is a kind of RNN model that deals with the vanishing gradient problem. It learns to keep the relevant content of the sentence and forget the non relevant ones based on training. This model preserves gradients over time using dynamic gates that are called memory cells. At each input state, a gate can erase, write and read information from the memory cell. Gate values are computed based on linear combinations of the current input and the previous state.

The hidden state acts as the neural networks memory. It holds information on previous data the network has seen before.

The operations on the information is controlled by three corresponding gates:

Forget gate: Controls which content to keep and which should be forgotten from prior steps.

Input Gate: Controls which information from the current step is relevant to add to the next steps.

Output Gate: Controls what should be the next hidden state, i.e. the output of the current step.

What is BiLSTM Model?

Bidirectional LSTM (BiLSTM) model maintains two separate states for forward and backward inputs that are generated by two different LSTMs. The first LSTM is a regular sequence that starts from the beginning of the sentence, while in the second LSTM, the input sequence are fed in the opposite order. The idea behind bi-directional network is to capture information of surrounding inputs. It usually learns faster than one-directional approach although it depends on the task.

For more information on LSTM, I recommend you to continue reading this blog-post.

Building a LSTM\BiLSTM Model

Let’s code!

First, let’s define the hyper-parameters for the LSTM model:

lr = 1e-4
batch_size = 50
dropout_keep_prob = 0.5
embedding_size = 300
max_document_length = 100 # each sentence has until 100 words
dev_size = 0.8 # split percentage to train\validation data
max_size = 5000 # maximum vocabulary size
seed = 1
num_classes = 3
num_hidden_nodes = 93
hidden_dim2 = 128
num_layers = 2 # LSTM layers
bi_directional = False
num_epochs = 7

LSTM Class

In this tutorial we will go over the LSTM layers and how they work. Our architecture will contain implementation for LSTM or BiLSTMs with 93 units followed by 1-fully connected layer with 128 units and 0.5 dropout rate.

Constructor

We will define all of the attributes of the MLP class in __init__ , and then we will define the forward pass by forward function. In the Sentiment Analysis with Pytorch — Part 2 — Linear Model, we explained in detail on the general structure of the classes and the attribute inheritance from nn.Module. We also had a deep review in Sentiment Analysis with Pytorch — Part 3 — CNN Model on the differences between the layers and the dimensions.

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.autograd import Variable

class LSTM(nn.Module):

# define all the layers used in model
def __init__(self, vocab_size, embedding_dim, lstm_units, hidden_dim , num_classes, lstm_layers,
bidirectional, dropout, pad_index, batch_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_index)
self.lstm = nn.LSTM(embedding_dim,
lstm_units,
num_layers=lstm_layers,
bidirectional=bidirectional,
batch_first=True)
num_directions = 2 if bidirectional else 1
self.fc1 = nn.Linear(lstm_units * num_directions, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, num_classes)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.lstm_layers = lstm_layers
self.num_directions = num_directions
self.lstm_units = lstm_units


def init_hidden(self, batch_size):
h, c = (Variable(torch.zeros(self.lstm_layers * self.num_directions, batch_size, self.lstm_units)),
Variable(torch.zeros(self.lstm_layers * self.num_directions, batch_size, self.lstm_units)))
return h, c

def forward(self, text, text_lengths):
batch_size = text.shape[0]
h_0, c_0 = self.init_hidden(batch_size)

embedded = self.embedding(text)
packed_embedded = pack_padded_sequence(embedded, text_lengths, batch_first=True)
output, (h_n, c_n) = self.lstm(packed_embedded, (h_0, c_0))
output_unpacked, output_lengths = pad_packed_sequence(output, batch_first=True)
out = output_unpacked[:, -1, :]
rel = self.relu(out)
dense1 = self.fc1(rel)
drop = self.dropout(dense1)
preds = self.fc2(drop)
return preds

Pack_padded_sequence \ Pad_packed_sequence Functions

The pack_padded_sequence is a format that enables the model to ignore the padded elements. LSTM model does not distinguish between padded elements and regular elements, but using this function it will not perform gradients calculation for backpropagation step for the padded values. When we feed the model with packed input it becomes dynamic and save unnecessary calculations. The pad_packed_sequence function is a reversed operation for pack_padded_sequence and will bring the output back to the familiar format [batch_size, sentence_length, hidden_features].

If you want to read more about it you can do so by this link.

Init_hidden Function

In the beginning we need to initialize the hidden states to zero and feed the LSTM layer with it so we can use a function that will do it for us for each batch separately.

LSTM Layer

Pytorch’s nn.LSTM expects to a 3D-tensor as an input [batch_size, sentence_length, embbeding_dim].

For each word in the sentence, each layer computes the input i, forget f and output o gate and the new cell content c’ (the new content that should be written to the cell). It will also compute the current cell state and the hidden state.

Parameters for LSTM Layer:

Input_size: The number of features for each element in the input in our model. E.g., In our case each element (word) has 300 features that refer to the embedding_dim.

Hidden_size: This variable defines the number of LSTM hidden units.

Num_layers: This argument defines for multi-layer LSTMs the number of stacking LSTM layers in the model. In our case for example, we set this argument to lstm_layers=2 which means that the input x at time t of the second layer is the hidden state h at time t of the previous layer multiplied by dropout.

Batch_first: nn.LSTM layer expects the batch dimension in the input to be first as [batch_size, sentence_length, embbeding_dim] using the batch_first=TRUE it can be provided.

Dropout: If this argument will be greater than zero, it will produce Dropout layer with dropout probability on each output of the LSTM layer except the last one.

Bidirectional: By changing bidirectional variable modes we can control the model type (False= LSTM\True= BiLSTM).

The inputs and output for the LSTM Layer can be explained by the diagram below (w represents the number of LSTM layers, in our case it’s equal to 2):

The input of the LSTM Layer:

Input: In our case it’s a packed input but it can also be the original sequence while each Xi represents a word in the sentence (with padding elements).

h_0: The initial hidden state that we feed with the model.

c_0: The initial cell state that we feed with the model.

The output of the LSTM Layer:

Output: The first value returned by LSTM contains all the hidden states throughout the sequence.

h_n: The second output are the last hidden states of each of the LSTM layers.

c_n: The third output is the last cell state for each of the LSTM layers.

To get the hidden state of the last time step we used output_unpacked[:, -1, :] command and we use it to feed the next fully-connected layer.

Another Way to Build LSTM Class

We will show another way to build the LSTM Class without using the unpacking function.

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class LSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim1, hidden_dim2, output_dim, n_layers,
bidirectional, dropout, pad_index):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_index)
self.lstm = nn.LSTM(embedding_dim,
hidden_dim1,
num_layers=n_layers,
bidirectional=bidirectional,
batch_first=True)
self.fc1 = nn.Linear(hidden_dim1 * 2, hidden_dim2)
self.fc2 = nn.Linear(hidden_dim2, output_dim)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)

def forward(self, text, text_lengths):
embedded = self.embedding(text)
packed_embedded = pack_padded_sequence(embedded, text_lengths, batch_first=True)

packed_output, (hidden, cell) = self.lstm(packed_embedded)
cat = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
rel = self.relu(cat)
dense1 = self.fc1(rel)
drop = self.dropout(dense1)
preds = self.fc2(drop)
return preds

There is another way to do the same thing that we explained before but instead of using pad_packed_sequence function we will use the h_n and c_n variables that keeps two vectors, one for the forward (hidden[-1, :, :] will take the last row) and one for the backward networks (hidden[-2, :, :] will take the second row from the end).

Cat Function

torch.cat((t1, t2), dim=0) concatenate the tensors by dim dimension. The outputs of the two directions of the LSTM are concatenated on the last dimension.

The forward network contains information about previous inputs and the backward network contains information about following inputs while the final state will be a combination of both of them. We will take the last hidden state of the forward output and the last hidden state of the backward output and merge them together.

The are a few other options to merge forward and backward state that can be used instead of concatenation such as: sum, mul, avg. The difference is that concat union between the final state of forward and backward states (the dimension increases) and the rest perform some manipulation that keeps the original dimensions. Usually concat is more common because it keeps more information that we loose when we use the other options.

Training, Evaluation and Test

The training, evaluation and test are exactly the same in all of the models. In the previous posts we explained in details about it.

Main Function

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
path = 'C:/Users/Gal/PycharmProjects/Sentiment_Analyzer'
path_data = os.path.join(path, "data")

# parameters
model_type = "LSTM"
data_type = "token" # or: "morph"

char_based = True
if char_based:
tokenizer = lambda s: list(s) # char-based
else:
tokenizer = lambda s: s.split() # word-based

Text.build_vocab(train_data, max_size=max_size)
Label.build_vocab(train_data)
vocab_size = len(Text.vocab)

train_iterator, valid_iterator, test_iterator = create_iterator(train_data, valid_data, test_data, batch_size, device)

# loss function
loss_func = nn.CrossEntropyLoss()
lstm_model = LSTM(vocab_size, embedding_size, n_filters, filter_sizes, pool_size, hidden_size, num_classes, dropout_keep_prob)

# optimization algorithm
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=lr)
# train and evaluation
if (to_train):
# train and evaluation
run_train(num_epochs, lstm_model, train_iterator, valid_iterator, optimizer, loss_func, model_type)

# load weights
lstm_model.load_state_dict(torch.load(os.path.join(path, "saved_weights_LSTM.pt")))
# predict
test_loss, test_acc = evaluate(lstm_model, test_iterator, loss_func)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc * 100:.2f}%')

End Notes

In this section we built LSTM and BiLSTM models with Pytorch. In the next part we will learn how to build MLP for Sentiment Analysis task with Pytorch. If you wish to continue to the next part here is the link for the next section in the serie: Sentiment Analysis with Pytorch — Part 5— MLP Model.

You can find the full code for this tutorial on Github.

References

[https://www.aclweb.org/anthology/C18-1190.pdf]

https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm

Sentiment Analysis with Pytorch — Part 1 — Data Preprocessing

Sentiment Analysis with Pytorch — Part 2 — Linear Model

Sentiment Analysis with Pytorch — Part 3 — CNN Model

--

--